Negative matching

A swapping-ground for Regular Expression syntax

Negative matching

Postby t000ny » Mon Feb 10, 2014 12:24 pm

Hi all, couldn't find anything on this subject doing a search, so I thought I ask.

Trying to do a negative match, to replace all "strange" characters in file names on my USB stick. (The car's media player can't handle them. So I have to replace them with standard characters.)

Can't get the the regular expressions to do a negative match to work in the program.
In Perl I can use all of the following expressions with success:
Code: Select all
$text =~ s|([a-zA-Z0-9 -_\n]*)[^a-zA-Z0-9 -_]+([a-zA-Z0-9 -_\n]*)|\1X\2|g;

$text =~ s|[^a-zA-Z0-9 -_\n]|X|g;

$text =~ s|[^\w\n]|X|g;

$text =~ s|[\W]|X|g;

All of them replaces characters outside of the range specified as expected:
Code: Select all
#! /bin//perl
my $text = "01 - Blinkar Blå (singelversion).mp3
02 - Från min radio.mp3
03 - Mr Jones maskin.mp3
04 - Kroppens automatik.mp3
05 - Krafter vi aldrig känner.mp3
06 - Stockholmsserenad.mp3
07 - Vidare.mp3
08 - Bärande våg.mp3
09 - 5-e avenyn.mp3
10 - Blinkar blå (maxiversion).mp3";

$text =~ s|([a-zA-Z0-9 -_\n]*)[^a-zA-Z0-9 -_]+([a-zA-Z0-9 -_\n]*)|\1X\2|g;
print "$text \n";

$text =~ s|[^a-zA-Z0-9 -_\n]|X|g;
print "$text \n";

$text =~ s|[^\w\n]|X|g;
print "$text \n";

$text =~ s|[\W]|X|g;
print "$text \n";

Gives the expected output:
Code: Select all
01 - Blinkar BlX (singelversion).mp3
02 - FrXn min radio.mp3
03 - Mr Jones maskin.mp3
04 - Kroppens automatik.mp3
05 - Krafter vi aldrig kXnner.mp3
06 - Stockholmsserenad.mp3
07 - Vidare.mp3
08 - BXrande vXg.mp3
09 - 5-e avenyn.mp3
10 - Blinkar blX (maxiversion).mp3

01 - Blinkar BlX (singelversion).mp3
02 - FrXn min radio.mp3
03 - Mr Jones maskin.mp3
04 - Kroppens automatik.mp3
05 - Krafter vi aldrig kXnner.mp3
06 - Stockholmsserenad.mp3
07 - Vidare.mp3
08 - BXrande vXg.mp3
09 - 5-e avenyn.mp3
10 - Blinkar blX (maxiversion).mp3

01XXXBlinkarXBlXXXsingelversionXXmp3
02XXXFrXnXminXradioXmp3
03XXXMrXJonesXmaskinXmp3
04XXXKroppensXautomatikXmp3
05XXXKrafterXviXaldrigXkXnnerXmp3
06XXXStockholmsserenadXmp3
07XXXVidareXmp3
08XXXBXrandeXvXgXmp3
09XXX5XeXavenynXmp3
10XXXBlinkarXblXXXmaxiversionXXmp3

01XXXBlinkarXBlXXXsingelversionXXmp3X02XXXFrXnXminXradioXmp3X03XXXMrXJonesXmaskinXmp3X04XXXKroppensXVidareXmp3X08XXXBXrandeXvXgXmp3X09XXX5XeXavenynXmp3X10XXXBlinkarXblXXXmaxiversionXXmp3X

(Last row converts the CR LF charaters too, hence the single line)
But I can't get any of those expressions to work inside Bulk Rename Utility. :-(
Image
Image
...and so on...

Technically there's no need for the newline(\n) here, but it makes no difference. Have tried without that too.
Any ideas?
t000ny
 
Posts: 5
Joined: Mon Feb 10, 2014 11:56 am

Re: Negative matching

Postby Stefan » Mon Feb 10, 2014 8:47 pm

With BRU you have to match the whole file name by the used regex expression.
You can't just search for independent pattern in between the name.


For your issue you could try Character Translations
Code: Select all
å=a
è=e
ü=u,e
ê=e



See Character Translations HowTo >>> www.bulkrenameutility.co.uk/forum/viewtopic.php?f=4&t=1329&p=3705



.
Stefan
 
Posts: 736
Joined: Fri Mar 11, 2005 7:46 pm
Location: Germany, EU

Re: Negative matching

Postby t000ny » Tue Feb 11, 2014 9:56 am

I am matching the whole name in example 1, but it doesn't work in BRU. It works perfectly in Perl. So I am wondering if it is a bug of some kind.

Problem is it it is not just those four characters I want to translate, it is any unexpected character not in the square brackets...
t000ny
 
Posts: 5
Joined: Mon Feb 10, 2014 11:56 am

RegexMatch cannot match global/all occurences

Postby truth » Tue Feb 11, 2014 2:17 pm

BRU cant specify a global-occurrence-match via regex (like the g in your expressions)
So the #occurences must be spec'd within the match-side of your expression (BRU's MatchBox)
I know, I know... How about a little global-checkbox in there?? Perhaps one-day, idk

You could always use:
([a-zA-Z0-9 -_]*)[^a-zA-Z0-9 -_]+([a-zA-Z0-9 -_]*)(.*)
\1X\2\3

But without the g-modifier, the negated-set can only match a 1st-occurence of StrangeChar(s)
Note how Group3 prevents filename-cropping if Group2 finishes matching before $EndFileName
That's what alot of people mean by 'whole-name' matching.
truth
 
Posts: 221
Joined: Tue Jun 25, 2013 3:39 am
Location: Earth, OrionArm, MilkyWay

Re: Negative matching

Postby t000ny » Fri Feb 14, 2014 10:57 am

Aaah, OK thanks for that clarification!
But this still doesn't work I'm afraid.
Your example does not match the file names it should. it doesn't match anything at all in the list. So it seems to be something a bit fishy about the caret negate function.

Another example:
[^z]
X
Replaces the full name of all files with standard characters in my listing in my first post. But without any qualifier it shouldn't. It only should match the first occurrence in any string that is not the letter 'z'. So it is not matching the full file name.
The same happens with [^z]+ as well.

Not sure if this is the right place for bug reports though? I am in no need myself to get this fixed, since I renamed my files using Perl when I couldn't get this working. ;-)
t000ny
 
Posts: 5
Joined: Mon Feb 10, 2014 11:56 am

Match-It AND Group-It, or Lose-It

Postby truth » Fri Feb 14, 2014 5:34 pm

Note that our posts might be using chars not in your filenames.
Also, BRUs regex may deal with Unicodes as 'unknowns' (even: unknown if not-part of set!)
For the record, here's the results I get:

--- OrigName ------------------ NewNameWithGroup3 -------- NewNameWithoutGroup3
Från min radio ---------------- FrXn min radio ----------------- FrXn min radio
Krafter vi aldrig känner ------ Krafter vi aldrig kXnner ------- Krafter vi aldrig kXnner
Bärande våg ------------------- BXrande våg--------------------- BXrande v

I mainly wanted to show how adding (.*) could prevent name-cropping,
when its preceeding match might not extend all-the-way to $EndName.
(so the regex could be run multiple-times to capture more occurences).

Here's a good rule-of-thumb with BRUs regex:
Match-it and Group-It or Lose-It!

The [^z] matches any filename with a non-z char, but doesnt group anything, so all chars are lost
(since there's no way to represent them in BRUs ReplaceBox).

Its functionally equivalent to the expression .*[^z].* where your filename chars are indeed matched,
but not grouped, so again: lost since theres no way to input them with \1, \2, etc.

To specify Names beginning with any non-z char, just use ^[^z]
Not sure if thats what you're testing, it sounds like the main thing is just group-anything you plan on keeping.
truth
 
Posts: 221
Joined: Tue Jun 25, 2013 3:39 am
Location: Earth, OrionArm, MilkyWay

Re: Match-It AND Group-It, or Lose-It

Postby t000ny » Fri Feb 14, 2014 8:07 pm

truth wrote:Note that our posts might be using chars not in your filenames.
Also, BRUs regex may deal with Unicodes as 'unknowns' (even: unknown if not-part of set!)
For the record, here's the results I get:

--- OrigName ------------------ NewNameWithGroup3 -------- NewNameWithoutGroup3
Från min radio ---------------- FrXn min radio ----------------- FrXn min radio
Krafter vi aldrig känner ------ Krafter vi aldrig kXnner ------- Krafter vi aldrig kXnner
Bärande våg ------------------- BXrande våg--------------------- BXrande v

OK, I get that any unicode character can be "unknown", but still shouldn't it me a positive non-match? IE, the "[^a-zB-Z0-9 -_]" part shoule match an "å" for example?
As shown in my first post, I am not getting the same results.:-(
Since it is shown the same in the GUI, it sounds strange it would be a problem with the code page or the filesystem encoding? But I am no expert on that... And I had no problems to do the whole rename operation in Perl. (After setting the code page correctly.)
I mainly wanted to show how adding (.*) could prevent name-cropping,
when its preceeding match might not extend all-the-way to $EndName.
(so the regex could be run multiple-times to capture more occurences).

Yes, got it. Thanks for that! Which then leads me to wonder if it can be written recursively (since there's no global replace option). But this is just a exercise for my minds, not needed for anything right now. Probably could do a match of, say ten occurrences of non-standard characters and see if that would be enough. :-)
Here's a good rule-of-thumb with BRUs regex:
Match-it and Group-It or Lose-It!

Indeed! Group and reference it. :-)

The [^z] matches any filename with a non-z char, but doesnt group anything, so all chars are lost
(since there's no way to represent them in BRUs ReplaceBox).

Its functionally equivalent to the expression .*[^z].* where your filename chars are indeed matched,
but not grouped, so again: lost since theres no way to input them with \1, \2, etc.

To specify Names beginning with any non-z char, just use ^[^z]
Not sure if thats what you're testing, it sounds like the main thing is just group-anything you plan on keeping.

OK, so I don't have to match the complete file name then, like Stefan said in the second post? That was my main point there: Without a quantifier it shouldn't match any full file name.

Going back to my original query, about the caret for non-matching... I tried the same on a directory with just standard characters in the file names and it is still not playing ball. Looking for replacing upper case 'A' this time.
Match: ([a-zB-Z0-9 -_]*)([^a-zB-Z0-9 -_]+)([a-zB-Z0-9 -_]*)(.*)
Replace: \1 X \2 Y \3 Z \4
Result:
atdcm64a.sys atdcm64a.sys
oneATILog.dll oneATILog.dll
twoATIManifestDLMExt.dll twoATIManifestDLMExt.dll
ATISetup.exe ATISetup.exe
DetectionManager.dll DetectionManager.dll


But if I go for lower case 'a' as well:
Match: ([b-zB-Z0-9 -_]*)([^b-zB-Z0-9 -_]+)([b-zB-Z0-9 -_]*)(.*)
Replace: \1 X \2 Y \3 Z \4
Result:
atdcm64a.sys X a Y tdcm64 Z a.sys
oneATILog.dll oneATILog.dll
twoATIManifestDLMExt.dll twoATIM X a Y nifestDLMExt Z .dll
ATISetup.exe ATISetup.exe
DetectionManager.dll DetectionM X a Y n Z ager.dll

Capital 'A' in this case is never replaced as non-matched by the expression.
Could well be something on my computer that is different. I'm running Windows 7 64 bit with UK locale and NTFS file system, so it shouldn't be that special really...
t000ny
 
Posts: 5
Joined: Mon Feb 10, 2014 11:56 am

Matching Negated Sets

Postby truth » Sat Feb 15, 2014 11:31 am

Correct, you dont have to match the entire filename, but NotMatched=NotGrouped=Lost
So its usually desirable to be as match-specific as possible.

Remember, the [^z] example is equivalent to .*[^z].*
That means the entire fullname IS matched (except names like zz, zzzzzzzzzzz, etc)
But without grouping: ReplaceBox becomes NewName for all files (except above examples)

Note this is BRUs general match-behavior, its not from using negated sets.
Place e into the MatchBox & you get the same .*e.* match-behavior.
For filenames with e anywhere: all-chars matched, but not-grouped, & lost to ReplaceBox.

I cant say for certain why the regex fails on your system.
All I can do is copy the forum-chars, & hope they're the correct ones to work with.
You could create filenames from forum-chars, just to verify we're using the same chars?

I'm using default code-page 437, both Vista&XP, 32bit, & BRU v2.7.0.2
I highly supect a version issue, as v2.7.1.2 is famous for that 'unknown if not-part of set' feature.
In fact, it can fail to regex-match any filename with such chars, even with .* as your match!

I should leave this alone for now, since we really need to determine your version# first.
I do have a copy of v2.7.1.2, so no probs testing with it, if needed.

Generally speaking however, CharTrans is BRU's equivalent to the s|old|new|g commands,
while its MatchBox sets the $text-variable (matched-names) to then be substituted.
The benefit of course, being a substitute-command that doesnt alter filename-matching.

The last example looks like it would fail because of the [ -_] set usage.
That's a very far-reaching 'set' that includes SubSets like A-Z,0-9,etc!
Basically, it includes just about every char besides 'lowercases' & brackets.
So [^ -_B-Z0-7] never matches any Uppercase or # (since [^ -_] already discludes A-Z0-9).

I probably added to the confusion with my 1st post, here's a much cleaner version:
([ -_a-z]*)[^ -_a-z]+([ -_a-z]*)(.*)
\1X\2\3
It processes identically, without the redundant A-Z0-9's (on version 2.7.0.2!)
I've prob gone too far already, not knowing your version, but it shouldnt matter in this last case.
For now, its prob wise to await confirmation before going any further.
truth
 
Posts: 221
Joined: Tue Jun 25, 2013 3:39 am
Location: Earth, OrionArm, MilkyWay

Re: Negative matching

Postby t000ny » Mon Feb 17, 2014 12:44 pm

Thanks for the reply truth!
Feeling stupid her for missing the accidental (but obvious) range in the last regexp. The dash should of course be escaped:
Match: ([a-zB-Z0-9 \-_]*)([^a-zB-Z0-9 \-_]+)([a-zB-Z0-9 \-_]*)(.*)
and that gives the correct results. :oops:
Am indeed running the latest version here: 2.7.1.2 had a look on the web site but could not find any archive with previous versions. So I can't try with the previous version.

Did a copy-paste of the forum text here creating new files with those names, and still no luck:
Match: ([A-Za-z0-9 \-_]*)([^A-Za-z0-9 \-_]+)([A-Za-z0-9 \-_]*)(.*)
Replace: \1X\3\4
Result:
01 - Blinkar Blå (singelversion).mp3__01 - Blinkar Blå (singelversion).mp3
02 - Från min radio.mp3___________02 - Från min radio.mp3
03 - Mr Jones maskin.mp3_________03 - Mr Jones maskin.mp3
04 - Kroppens automatik.mp3______04 - Kroppens automatik.mp3
05 - Krafter vi aldrig känner.mp3___05 - Krafter vi aldrig känner.mp3
06 - Stockholmsserenad.mp3______06 - Stockholmsserenad.mp3
07 - Vidare.mp3_________________07 - Vidare.mp3
08 - Bärande våg.mp3____________08 - Bärande våg.mp3
09 - 5-e avenyn.mp3_____________09 - 5-e avenyn.mp3
10 - Blinkar blå (maxiversion).mp3_10 - Blinkar blå (maxiversion).mp3

Nothing changed. So it is prbably like you say a problem with 'unknown if not-part of set'. The characters I'm struggling with here are:
ASCII 228: ä
ASCII 229: å
ASCII 246: ö
So they are outside the normal ASCII 7-bit range.
t000ny
 
Posts: 5
Joined: Mon Feb 10, 2014 11:56 am

2.7.1.2 Updated CharSets ?

Postby truth » Mon Feb 17, 2014 9:04 pm

Okay, thanks for verifying, we're definitely dealing with the same chars.
Version 2.7.0.2 is listed under 'Other Files' as a 'No-Installer' version.
They keep it avail for Win98/Me users, but I've never had a problem running it.

I'm now using 2.7.1.2 trying to see if there's a way to duplicate the results.
It doesnt seem likely, here are some test-results I got:
(.*)
--\1--

It refused to match any name with those chars!
If .* cant match them, I doubt anything else (in the match-expression) would help.

Solutions might involve changing locale, modding .dll's or other system-settings.
But so long as .* cant recognize the chars, I wouldnt try modifying the match.
I couldnt find any extra settings within 2.7.1.2 to allow the match.

Take that with a grain of salt, I never use this version, so perhaps I'm missing something?
But the logic holds true: So long as (.*) wont match, the culprit isnt your match-expression.
Hopefully I'm wrong, I refuse to accept there's no way around this, yet still no 2.7.1.2 solution.

I did try negating ScriptChars as /P{Script}, but in all cases, the results were identical to above.
Filenames with such chars were always unmatchable (regardless of match-expression).
Its prob some blacklist-vs-whitelist method of how the underlying code matches chars as 'PartOfSet' ?

I also tried the reverse:
Placing å into match (even as 0-occurence) guaranteed that NO filename could be matched!
I tried a few other things too, mostly just out of curiosity, all to no avail.
So far it seems like CharTrans is the best solution for 2.7.1.2
truth
 
Posts: 221
Joined: Tue Jun 25, 2013 3:39 am
Location: Earth, OrionArm, MilkyWay


Return to Regular Expressions


cron