Unicode characters and the "|" separator

Would you like Bulk Rename Utility to offer new functionality? Post your comments here!

Unicode characters and the "|" separator

Postby MKCA » Thu Nov 25, 2021 4:04 pm

There seems to be a bug affecting the "|" separator and Unicode characters.

When I try to do multiple replacements with alphanumeric characters everything works properly:
Code: Select all
Replace:     po|re
With:        ka|su

port.txt     >      kart.txt
retf.txt     >      sutf.txt
??.txt      >      ??.txt
??.txt      >      ??.txt


When I try to do a single replacement with Japanese characters, it also works properly:
Code: Select all
Replace:     ?
With:        ka

port.txt     >      port.txt
retf.txt     >      retf.txt
??.txt      >      ka?.txt
??.txt      >      ??.txt


However, when I try to use the "|" separator with Japanese characters, it bugs out:
Code: Select all
Replace:     ?|??
With:        ka|sute

port.txt     >      kapkaokarkatka.txt
retf.txt     >      karkaekatkafka.txt
??.txt      >      ka.txt
??.txt      >      ka.txt


It actually happens as soon as the "I" separator is inputted:
Code: Select all
Replace:     ?|
With:        ka|

port.txt     >      kapkaokarkatka.txt
retf.txt     >      karkaekatkafka.txt
??.txt      >      ka.txt
??.txt      >      ka.txt



But will work correctly if additional alphanumerical characters are introduced:
Code: Select all
Replace:     ?a|??b
With:        ka|sute

port.txt     >      port.txt
retf.txt     >      retf.txt
?a?.txt     >      ka?.txt
??b.txt     >      sute.txt


The issue occurs with the "With" field too, except here it only causes the characters to be read as empty:
Code: Select all
Replace:     po|re
With:        ?|?

port.txt     >      rt.txt
retf.txt     >      tf.txt


Replace:     po|re
With:        ?a|b?

port.txt     >      ?art.txt
retf.txt     >      b?tf.txt


It affects other types types of Unicode characters too:
Code: Select all
Replace:     po|
With:        ????|

port.txt     >      rt.txt


Replace:     po|
With:        ??|

port.txt     >      rt.txt


Replace:     po|
With:        ?|

port.txt     >      rt.txt


But not all, for example Cyrillic or Greek works fine:
Code: Select all
Replace:     po|
With:        ??????|

port.txt     >      ??????rt.txt

Replace:     po|
With:        ??????|

port.txt     >      ??????rt.txt


I tested this on multiple computers, with different System Locale, both the 32 and 64 bit versions, both portable and installed.
MKCA
 
Posts: 6
Joined: Thu Nov 25, 2021 1:36 am

Re: Unicode characters and the "|" separator

Postby therube » Fri Nov 26, 2021 5:41 pm

Your wanted characters aren't displaying (correctly) at least to me (simply displaying as ? [question marks]) & I'm thinking they could (as in the board doesn't look like it shouldn't be affecting what is seen).
Code: Select all
?    ?    ?    ?    ?    ?    ?

Ah, they were fine in the Preview, but once the post Posted, they were replaced by ?, so it is the board.
So maybe you could host the sample file names elsewhere?

Looks like a site like snippet.host, should work.
(Above link valid for 1 month only.)
therube
 
Posts: 1314
Joined: Mon Jan 18, 2016 6:23 pm

Re: Unicode characters and the "|" separator

Postby MKCA » Sat Nov 27, 2021 1:43 am

? ?? ?? ????
MKCA
 
Posts: 6
Joined: Thu Nov 25, 2021 1:36 am

Re: Unicode characters and the "|" separator

Postby MKCA » Sat Nov 27, 2021 1:49 am

LMAO
The board can't even handle the characters that actually worked in the application in the first place! It even gave me an error when submitting! :lol:
I will try posting links then.
MKCA
 
Posts: 6
Joined: Thu Nov 25, 2021 1:36 am

Re: Unicode characters and the "|" separator

Postby MKCA » Sat Nov 27, 2021 2:24 am

Attempt 2: Unicode boogaloo
(Thank you therube, for suggesting snippet.host)
/////////////////////////////////////////////////////////////////////////////////////
There seems to be a bug affecting the "|" separator and Unicode characters.

When I try to do multiple replacements with alphanumeric characters everything works properly:
https://snippet.host/voap

When I try to do a single replacement with Japanese characters, it also works properly:
https://snippet.host/awgy

However, when I try to use the "|" separator with Japanese characters, it bugs out:
https://snippet.host/knvw

It actually happens as soon as the "|" separator is inputted:
https://snippet.host/omea

But will work correctly if additional alphanumerical characters are introduced:
https://snippet.host/zabw

The issue occurs with the "With" field too, except here it only causes the characters to be read as empty:
https://snippet.host/bzxr

It affects other types types of Unicode characters too:
https://snippet.host/buur

But not all, for example Cyrillic or Greek works fine:
https://snippet.host/zgqy

I tested this on multiple computers, with different System Locale, both the 32 and 64 bit versions, both portable and installed.
MKCA
 
Posts: 6
Joined: Thu Nov 25, 2021 1:36 am

Re: Unicode characters and the "|" separator

Postby Admin » Sat Nov 27, 2021 9:50 am

We need to check this issue for the next update!
Admin
Site Admin
 
Posts: 2343
Joined: Tue Mar 08, 2005 8:39 pm

Re: Unicode characters and the "|" separator

Postby MKCA » Thu Aug 18, 2022 5:11 pm

The issue persists as of version 3.4.4.0
MKCA
 
Posts: 6
Joined: Thu Nov 25, 2021 1:36 am

Unicode characters and the "|" separator

Postby Luuk » Fri Aug 19, 2022 3:19 pm

Its unfortunate Im not finding this post sooner, because Im also did find a similiar glitch with using the "|" character.
Except mine is about when trying to use the different \modifiers\ like... \first\, \third\, \last\, \end\, etc.
Many times, I did experiment with using a "Replace" like... \first\a|\second\b|\last\c|\end\xyz

But only the very 1st \modifier\ is granted, with the others being ignored because "|" will not terminate the modifier!
So any "Replace" like... \first\a|\second\b|\first\c|\last\xyz|b|c replaces either the 1st:a -or- 1st:b -or- 1st:c.
So its just like using.. \first\a|b|c because everything after \first\ except "|" is getting conducted as plain text.

I did experiment with many things like "||", and "\stop\", and "\quit\", but nothing could ever terminate the very first \modifier\.
So Im guessing this problem is related, but not the same, and probably comes from how many-bytes is needed for the character??
Im saying this because looking at the other website, and seeing that it does only seem to happen, with some of the unicodes??

I think an easy workaround for the "With" box, can be using "\" as the very 1st-character, with Replace(5) also removing any of the "\".
That way, with only 1 "bad-byte-character" to replace, if its "waiting for more-bytes", the "\" gives extra bytes to satisfy the "|" ???
And if theres a lot of characters to replace, then Remove(5) could just remove the "\" from the final replacement.

Of course, its just me guessing about the reasons, and this also cannot work inside of the "Replace" box anyways.
This me guessing again, but probably its because the "Replace" wants to preserve "\" for saying its \modifiers\ ??
So really for right now, the only workarounds seem to be either using the RegEx(1) or Character Translations.

Im also like to report that RegEx(1) does have the exact same problem with using these "bad-byte-characters".
But since RegEx(1) uses "\" to say "literal", the "\" workaround conducts properly for both the "Match" and "Replace".
The bad-part for using RegEx(1), is that it needs a checkmark inside for "v2", and then some very long formats like...
\SomeBadByteChar1/g(?X)\SomeBadByteChar2/g(?X)...
\Any-Replacements-1(?X)\Any-Replacements-2(?X)...

So for now, Im thinking most users like to prefer Character Translations for any workarounds...
SomeBadByteChar1=SomeOtherChar1
SomeBadByteChar2=SomeOtherChar2
a,b,BadByteChar3=A,B,SomeChar3
...

Its does seem strange that only RegEx(1) and Replace(3) do both have this exact same problem with using the "bad-byte-characters".
So this makes me believe that our replace-strings in Replace(3) are really first being converted into regex, before being conducted?
If anybody else does have similiar experiences, please to post them here, because Im thinking it can provide clues to the programmers.
Luuk
 
Posts: 691
Joined: Fri Feb 21, 2020 10:58 pm

3-byte UTF8 Unicodes and the "|" separator

Postby Luuk » Tue Aug 30, 2022 4:35 am

Many thanks to MKCA for posting this, because it does help a lot in making many discoveries!!
It seems that the problem-characters is always UTF8 characters that use only 3-bytes of data.
If anybody is curious, UTF8 can say its chars with either: 1, 2, 3, or 4 bytes ...

Code: Select all
==UTF8==    Byte1    Byte2    Byte3    Byte4     [Ranges] to match them with "v2" regexs
(ASCII)     00-7F    -----    -----    -----     [\x00-\x7f]
(2-bytes)   C2-DF    80-BF    -----    -----     [\xc2-\xdf][\x80-\xbf]
(3-bytes)   E0–EF    80-BF    80-BF    -----     [\xe0-\xef][\x80-\xbf][\x80-\xbf]
(4-bytes)   F0–F4    80-BF    80-BF    80-BF     [\xf0-\xf4][\x80-\xbf][\x80-\xbf][\x80-\xbf]

So its a good format, because Byte1 is also always saying "how many more bytes" will follow to say the whole character.
The 2-4 bytes is unicodes, but Im not know their real names, so Im just naming them for how many bytes they're using.
Its only the 3-byte characters presenting this problem, and its only inside of RegEx(1) or Replace(3) with "|".

So now Im wondering if adding the "|" is what tells Replace(3) to convert our strings into regexs?
Im also wondering if its same reason that the "|" cannot terminate the different \modifiers\?
Im no ideas about programming code, but maybe this information can help the programmers.
Luuk
 
Posts: 691
Joined: Fri Feb 21, 2020 10:58 pm

Re: 3-byte UTF8 Unicodes and the "|" separator

Postby MKCA » Wed Sep 07, 2022 12:50 pm

Luuk wrote:Many thanks to MKCA for posting this, because it does help a lot in making many discoveries!!

I gotchu bro! 8)
We, the "|" enjoyers must stick together!
MKCA
 
Posts: 6
Joined: Thu Nov 25, 2021 1:36 am


Return to Suggestions