Character replacement results in 'replacement character'

Post any Bulk Rename Utility support requirements here. Open to all registered users.

Matching hex-bytes with RegEx(1)

Postby Luuk » Fri Feb 21, 2025 10:01 pm

So that nobody gets confused, this answer is only for... https://www.bulkrenameutility.co.uk/forum/viewtopic.php?f=4&t=6882&start=15#p19518

Orginally, when you posted about the combining-diaeresis and combining-cedilla, I thought that those-two were the only combining-diacritics!
So I just provided \xcc[\xa7\x88]/g to match them, but once you provided http://www.fileformat.info/info/unicode/block/combining_diacritical_marks/list.htm
I'm changing it to \xcc[\x80-\xbf]|\xcd[\x80-\xa2]/g to match all of them (except the last-13 looking like regular characters).

So you could remove your \xA7\x88, since both of them are already within \x80-\xbf's range.
Also, be careful adding + in front of /g, it wont hurt with your deletions, but replacements are different.
The /g always matches "all", but if you started using RegEx(1) to conduct re-mapping, the difference would be...

[ä]/g
a
AäääZ ----> AaaaZ

[ä]+/g
a
AääääZ ---> AaZ

So the way you're using it so far (even with \xA7\x88), there's not to be any problems! Just make sure re-mappings conduct 1st.
Also, if you wanted to make both of those minor changes, then the whole new "Match" would look like...
\xcc[\x80-\xbf]|\xcd[\x80-\xa2]/g(?X)(?!€|£)[\xc2-\xf4][\x80-\xbf]/g(?X)[^]A-Za-z0-9@_',;!£$€%&=#~ `^\-+[(){}.]/g

Also, if your remaps of both and £ conduct before RegEx(1), then (?!€|£) is not needed to prevent their deletions.
Luuk
 
Posts: 809
Joined: Fri Feb 21, 2020 10:58 pm

Re: Character replacement results in 'replacement character'

Postby Luuk » Sun Feb 23, 2025 1:51 am

Many apologies!! If deciding to make both changes, then the whole "Match" to instead look like...
\xcc[\x80-\xbf]|\xcd[\x80-\xa2]/g(?X)(?!€|£)[\xc2-\xf4][\x80-\xbf]+/g(?X)[^]A-Za-z0-9@_',;!£$€%&=#~ `^\-+[(){}.]/g

During the editing, I accidentally removed ALL of the + signs, instead of just the added-ones (again NO problem with deletions).
The original [\x80-\xbf]+ is critical, to match 1-or-more trailing-unicode-bytes (for all 2, 3, and 4-byte characters) to be deleted.
The (?!€|£) is only needed to exempt those 2-characters from deletion, if they're not getting replaced before RegEx(1) conducts?
Luuk
 
Posts: 809
Joined: Fri Feb 21, 2020 10:58 pm

Re: Character replacement results in 'replacement character'

Postby TheGhost78 » Sun Feb 23, 2025 3:04 pm

Thanks, Luuk. No, I'm keeping the pound and Euro signs in the filenames.
TheGhost78
 
Posts: 181
Joined: Fri Jul 19, 2024 11:25 am


Re: Character replacement results in 'replacement character'

Postby Admin » Thu Feb 27, 2025 4:36 am



This is a bug with Remove -> Accents and characters which are surrogate pairs, like the examples you posted.
In a UTF16 encoded string (what BRU supports), each character is a 16-bit code unit. Characters outside the Basic Multilingual Plane (BMP) are represented using a surrogate pair—a combination of two code units:
- The high surrogate is in the range 0xD800–0xDBFF.
- The low surrogate is in the range 0xDC00–0xDFFF.
A character that is a surrogate pair confuses Remove -> Accents in BRU.
We will fix this in the next build.
Admin
Site Admin
 
Posts: 2923
Joined: Tue Mar 08, 2005 8:39 pm

Re: Character replacement results in 'replacement character'

Postby Admin » Tue Mar 04, 2025 12:21 am

Admin
Site Admin
 
Posts: 2923
Joined: Tue Mar 08, 2005 8:39 pm

Previous

Return to BRU Support