Remove everything except A-Z and 0-9?

A swapping-ground for Regular Expression syntax

Remove everything except A-Z and 0-9?

Postby BairStrokes » Sun Mar 28, 2021 11:23 am

Hi team,

I have a challenge to remove invalid windows characters from 1.6 TB of data usually used by MAC machines. And because of that lots of file names have special UniCode characters. I tried various remove options but looks like only way to remove that Unicode character is to enter that into Remove (5) ==> Chars. But since the amount of data and files its a challenge to find all different unicode across all the data.

One thing that came to my mind was if there was a way to tell BRU to keep certain characters and remove rest. Example just keep A-Z, a-z, 0-9, "space", etc.

Is it possible with BRU?

Regards,
BairStrokes
 

Re: Remove everything except A-Z and 0-9?

Postby Luuk » Sun Mar 28, 2021 1:35 pm

To remove only the unicodes, you can try a 'v2' regex using a "Match" like...
Code: Select all
[^\x20-\x7F]+/g

If you like to specify all of the characters to keep, then to use a "Match" like...
Code: Select all
[^-A-Za-z0-9 ._!]/g
Its only keeping the characters like your description, but also keeping '-' and '.' and '_' and '!', so then just add more characters after the '!'.
The very first ^ is saying "everything except", so then not putting anything in the "Replace" to destroy "everything else".
Luuk
 
Posts: 690
Joined: Fri Feb 21, 2020 10:58 pm


Return to Regular Expressions