by Luuk » Thu Feb 20, 2025 5:54 am
In my original reply, the link was for another user, who was also having problems with unicodes and "|" in Replace(3).
So now Im guessing its probably best to avoid using Replace(3), when there's many different unicodes to be replaced?
The original poster seemed to only have problems with some 3-byte unicodes, but now I see its also some 2-byte ones!
It seems like the character-bytes of some characters, keep Replace(3) from interpeting '|' as 'Next-Match' ??
But RegEx(1), Character-Translations, and Javascript dont seem to have any problem with those characters.
Also, to remove combining-diaeresis/cedilla from 3-byte unicodes, the v2 RegEx(1) can use a "Match" like...
\xCC[\xA7\x88]/g
Or if you need it to look more logical...
(.)(\xCC\xA7|\xCC\x88)/g
$1
Also, you could just add either of the regexs into the javascript-code, to keep everything there.
But of course, "Character Translations" still needs 1-line for each character to be translated.
Remember, some 2-byte/3-byte unicodes look identical, so ä (the 2-byte one) would not be converted.
So all other non 3-byte characters would still have to be mapped, and Im not know any of them.
Im just now learning about many of them, as you post the characters to other pages, etc.
If you have any kind of a complete list, to say their conversion-characters...
I'm sure I could convert it into either regex, javascript, or whatever you prefer.
But of course if you did, you'd probably already be fininshed by now, so still experimenting.
Anyway, feel free to post any more problem characters!