Need help with complicated renaming please

A swapping-ground for Regular Expression syntax

Re: Need help with complicated renaming please

Postby trm2 » Thu Feb 13, 2020 9:09 pm

The original author of BRU and BRC was Jim Willsher. He sold to TGRMN Software some time ago. I would probably
believe that the author here is a programming team and not one person, depending on the size of the company.
They are based in Australia, although the program and the email are still UK. That is why I have some 'English'
humor in the book.
trm2
 
Posts: 156
Joined: Wed Jan 15, 2020 12:47 pm

Re: Need help with complicated renaming please

Postby bru » Thu Feb 13, 2020 9:57 pm

Gotcha, I truly believed that Admin here was probably Jim Willsher. Thanks for the info.
I'll post the logic behind the regexes soon.
bru
 
Posts: 62
Joined: Wed Jan 31, 2018 7:35 pm

Re: Need help with complicated renaming please

Postby trm2 » Fri Feb 14, 2020 11:37 pm

Post logic -

Beat you to it - took about a day but now I understand BRC (fast learner but equally fast forgetter)

Excerpt Taken from Volume 2 - from your entry (all formatting removed because it doesn't show up in forum except as garbled together):

I didn't post the whole analysis .. but it is complete in the book.

Your Regex works for the pattern of 2 uppercase followed by <hyphen>
However - your RegEx does not work for the pattern:

Uppercase Letters followed by <space> <hyphen> <space> or possible scenarios - the alternates below give you
a better example.

i.e.

CAN - FirstLast.docx


I also had to change your [|] to [}] because the | is an illegal character in a filename.

The key is in reg1

^([A-Z]{2,})-([A-Z][^A-Z -]*?)([A-Z].*)$

you have to add an alternative for the possible:

<hyphen> | <hyphen> <space> | <space> <hyphen | <space> <hyphen> <space>

Something similar.

I will await you new RegEx .. until then ….
------------------------------------------

Tim Notes -

General Analysis of the Batch file:

1. cd Change to the folder containing the files to rename
2. set Assign a Regex to each environmental variable identified as: reg1, reg2, max10, reg3 and reg4
3. brc32 Executes the Brc program
4. /Pattern: *.*.docx Process all Word docx files found in directory
5. %regx% Execute all RegEx in order specified
6. /Execute This parameter executes the Renaming Process. Leaving it off will provide a text display of which
files are renamed and which are not.

Notes:

1. reg2 is executed for each possible name found in string – up to 10 as defined in max10
2. Each RegEx adds the Replace string at the end



Tim Analysis:


The RegEx breaks down as:

reg1:
Match: ^([A-Z]{2,})-([A-Z][^A-Z -]*?)([A-Z].*)$
Replace: \1}\2 \3

reg2:
Match: ^(.*[}][^-]*)-([A-Z][^A-Z -]*?)([A-Z].*)$
Replace: \1, \2 \3

reg3:
Match: (.*[}].*), (.*)
Replace: \1 and \2

reg4:
Match: (.*)[}](.*)
Replace: \1 - \2


reg1 executes first followed by max10, reg3 and reg4.

Max10 will act sort of like a GOSUB statement in Batch terms, where control is passed to another RegEx, a SUBROUTINE in batch terms,
and when that RegEx finishes, control is returned back, RETURN statement in Batch terms, and then processng will continue.
Here, reg2 will run multiple times until the string is exhausted – no more matches, or there are no more ‘calls’ to any subroutines in
the originating statement, Max10, at which time reg3 and reg4 will execute.


Using sample: UK-PeterFord-JeremyJordan-LisaWendt.pdf (to show the flexibility I used a different sample)

After running the batch file and with /Execute removed, it displays a preview:

image missing

To rename the file, rerun the batch file with the /Execute restored.

Final result:

UK - Peter Ford, Jeremy Jordan and Lisa Wendt.pdf




Using sample: UK-PeterFord-JeremyJordan-LisaWendt.pdf


reg1 –

Match: ^([A-Z]{2,})-([A-Z][^A-Z -]*?)([A-Z].*)$
Replace: \1}\2 \3

image missing

Where:

\1 =UK
\2 = Peter
\3 =Ford-JeremyJordan-LisaWendt


Capture Group 1:
([A-Z]{2,})
Capture Group 2:
([A-Z][^A-Z -]*?)
Capture Group 3:
([A-Z].*)


1. ^
Start at beginning of string.

2. [A-Z]
Match against a class consisting of an uppercase letter
Capture Group 1 = U

3. {2,}
Limits the repetitions to 2 or more. This means that in order to be TRUE, the RegEx must match against at
least two uppercase letters in sequence.= TRUE = UK.
Capture Group 1 = UK

4. -
<hyphen> (Not Captured) followed by ..

5. [A-Z]
Match against a class consisting of an uppercase letter
Capture Group 2 = P

6. [^A-Z-] Negated Class.
[^A-Z] Continue Moving forward but ignore next uppercase letter
Capture Group 2 = Pe
[^-] Continue Moving forward but ignore next hyphen.
Capture Group 2 = unchanged

7. *
Make Greedy
Capture Group 2 = Peter

8. ?
But not too Greedy (Non-Greedy Metacharacter)
Capture Group 2 = P

9. [A-Z]
Match against a class consisting of an uppercase letter
Capture Group 3 = F
Changes Capture Group 2 = Peter

10. .*
Capture remaining string
Capture Group 3 = Ford-JeremyJordan-LisaWendt

11. $
Must be at EOF (End of File)

Conclusion: reg1-
a. Substitutes a } for the first hyphen after the first two uppercase letters in string
b. Space is added between first run-on word group by Replacement String.



reg2 –

Match: ^(.*[}][^-]*)-([A-Z][^A-Z -]*?)([A-Z].*)$
Replace: \1, \2 \3


Run #1 Sample after reg1 = UK}PeterFord-JeremyJordan-LisaWendt.pdf

missing image

Where:
\1 =UK}PeterFord
\2 = Jeremy
\3 =Jordan-LisaWendt

missing image

Run #2 Sample after reg2 Run #1 = UK}PeterFord, Jeremy Jordan-LisaWendt.pdf

missing image

Where:
\1 = UK}PeterFord, Jeremy Jordan
\2 = Lisa
\3 = Wendt

missing image

After Run #2 – String is Exhausted – No more Renames. Will return back and Process reg3 next.

missing image

Conclusion: reg2 –
a. substitutes a , for each remaining hyphen (after reg1). after the first two uppercase letters in string.
b. Adds spaces between all subseqent run-on word groups (Uppercase words) (up to 10 runs of reg2).
c. Each run of reg2 performs the ‘substitute comma, add space’ moving forward through the string from
left to right until the string is exhausted – no more renames can be done – RegEx fails.
d. Space is added between current run-on word group by Replacement String in each run.




Original sample: UK-PeterFord-JeremyJordan-LisaWendt.pdf
Current sample: UK}Peter Ford-JeremyJordan-LisaWendt.pdf (After Reg1)
After reg2 Run #1: UK}Peter Ford, Jeremy Jordan-LisaWendt.pdf

reg2 –
Match: ^(.*[}][^-]*)-([A-Z][^A-Z -]*?)([A-Z].*)$
Replace: \1, \2 \3



Where:
\1 =UK}Peter
\2 = Ford
\3 =JeremyJordan-LisaWendt

Capture Group 1:
(.*[}][^-]*)

Capture Group 2
([A-Z][^A-Z -]*?)

Capture Group 3
([A-Z].*)

1. ^
Start at beginning of string

2. .
Match against single character
Capture Group 1 = U

3. *
Make it Greedy until …
Capture Group 1 = entire string

4. [}]
.. Match against a class consisting of right brace character
Capture Group 1 = UK}

5. [^-]
Ignore next hyphen
Capture Group 1 = UK]P

6. *
Make it Greedy until …
Capture Group 1 = UK}Peter Ford

7. -
.. <hyphen> (Not Captured) followed by ..

8. [A-Z]
Match against a class consisting of an uppercase letter, followed by ..
Capture Group 2 = J

9. [^A-Z-] Negated Class.
Capture Group 2 = Je
[^A-Z] Continue Moving forward but ignore next uppercase letter.
[^-] Continue Moving forward but ignore next hyphen.
Capture Group 2 = unchanged

10. *
Make Greedy
Capture Group 2 = Jeremy

11. ?
But not too Greedy (Non-Greedy Metacharacter)
Capture Group 2 = J

12. [A-Z]
Match against a class consisting of an uppercase letter
Capture Group 3 = J
Changes Capture Group 2 = Jeremy

13. .*
Capture remaining string
Capture Group 3 = Jordan-LisaWendt

14. $
Must be at EOF (End of File)

Before reg2 runs: UK}PeterFord-JeremyJordan-LisaWendt.pdf
After the string is exhausted: UK}Peter Ford, Jeremy Jordan, Lisa Wendt.pdf

Original sample: UK-PeterFord-JeremyJordan-LisaWendt
Current sample: UK}Peter Ford, Jeremy Jordan, Lisa Wendt.pdf (After Reg2 exhausts)
After reg3: UK}Peter Ford, Jeremy Jordan and Lisa Wendt.pdf

etc.

------------------------------------------------

Don't forget to work on fixing that problem .. thanks.
trm2
 
Posts: 156
Joined: Wed Jan 15, 2020 12:47 pm

Re: Need help with complicated renaming please

Postby bru » Sat Feb 15, 2020 12:47 pm

Hi, Sorry about that.. Promise I was working on it, I can still post if you want.
Its pretty much done, not the best format, but nothing is colored to show groups, etc.
Explaining these things is at least 100x harder than just 'getting it done'.
My hat goes off to you on creating the manuals, its greatly appreciated.

Yes, I specifically tailored it to not match: "CAN - FirstLast.docx" (or any name with spaces).
The original poster's file-format was CAN-FirstLast-FirstLast.docx with either 2-or-3 names.
My goal was to prevent touching previously renamed files, but its just as doable with a SpaceDashSpace tagger.
I also beefed-it-up to match Only1Name, but never thought about the "CAN - Name" possibility.

In fact, I wrote another one to filter-out names like JuliaLouis-Dreyfus (would be improperly named).
Basically, it just ensures that only 2Uppercase may preceed any - in filename (or end of filename).
Could you please post a link to the problem, I cant seem to find it.. My very old eyes need a bigger smart-phone.
bru
 
Posts: 62
Joined: Wed Jan 31, 2018 7:35 pm

Re: Need help with complicated renaming please

Postby trm2 » Sat Feb 15, 2020 3:36 pm

It was not one of the original poster's but a test file of mine that I was using to see if your script
could be made more universal to handle those types of files as well. If you can do it fine, but if you can't that's fine too. Let me know. Also as I noted I had to change your
[|] to [}] because the | is an illegal character. Curious though - how did you get it to work using the | as you have in your finished script? - I don't know what country you are
from but does a Windows version in another country work differently?

Another script request:

Here is another question on a related topic of run-on word groups -
IFeelTheEarthMove

(viewtopic.php?f=3&t=3061&p=7962&hilit=ifeeltheearthmove#p7962)

This was again, limited solved by me using BRU (the other poster's RegEx didn't work). Same problem - it can handle (5 names) that and a couple more - then of course I run out of
capture groups. Admin supplied a successful JavaScript and of course it can be easily done using Word Space from section #7:add, however I need to present alternatives not just one method
that will work if at all possible because that is how I want people to learn.

With that being said, I tried to adapt your BRC script that handled multiple names and provided the needed spaces on each run using %max10%. That script relied on the hyphen as well as the
Uppercase letters to define the point of separation. This example is just run together with no hyphens. The attempts I tried failed.

My script was either working for the specific example, 'I Feel The Earth Move' - therefore would not perform on added words - 5 words, 7 words, 10 words, etc. as your script had done, or I would end up with
'I <multiple spaces>FeelTheEarth'Move, or 'I Feel TheEarthMove' or other variations no matter how many extra runs. The problem is that I see it, instead of ignoring what had already been spaced, and moving on to the
next run-on word group in the string on each subsequent run as your original BRC script does, it would always start with the beginning and including I, or Feel. This is because it had
nothing to lock onto - such as the hyphen. I deleted most of the attempts so I can only tell you the results. I would really like the BRC version so I can study it to see how you solved this problem.

Again, I am only using the 5 run-word groups as an example - I would like to be able to have it many more as your original script could - yours used reg2 run multiple times in max10 - I have fully analyzed it so I know
how it works but I can't make it adapt.

If you could when we finish up here - link to that (don't want to carry on a conversation about someone else's post - I don't think Admin would appreciate it.
trm2
 
Posts: 156
Joined: Wed Jan 15, 2020 12:47 pm

Re: Need help with complicated renaming please

Postby bru » Sat Feb 15, 2020 5:56 pm

Hi, just wanted to give a quick reply to making it also work with COUNTRY - FirstName
The 1st regex should be: /Regexp:"^([A-Z]{2,}) *- *([A-Z][^A-Z -]*?)([A-Z].*)$:\1|\2 \3"
aka:
^([A-Z]{2,}) *- *([A-Z][^A-Z -]*?)([A-Z].*)$
\1|\2 \3"

I'll get back to you on the other stuff soon.. Again, thanks for all that you're doing!
bru
 
Posts: 62
Joined: Wed Jan 31, 2018 7:35 pm

Re: Need help with complicated renaming please

Postby trm2 » Sat Feb 15, 2020 9:53 pm

Thanks.

Worked like a charm. I will analyze it this week. Now I will leave this post and continue in the other.
trm2
 
Posts: 156
Joined: Wed Jan 15, 2020 12:47 pm

Re: Need help with complicated renaming please

Postby bru » Mon Feb 17, 2020 6:36 am

OK.. Finally done with the coloring.. Major pain in the ***. Sorry I didnt get to all your questions.

Yes, I know were not supposed to put | in names.. I just do it cuz its unique, but I never leave them that way.
I thought you were pulling my leg with that 'illegal' character bit, so I googled it.. Turns out you're right.
You know, I'd seriously consider leaving any country that wastes its time passing such ridiculous laws.

Even if they seized your computer, how do they prove it was actually you who did the naming?
I hate to ask, but what's the penalty? Do they fine you for every little | char they find?
If you ask me, those politicians need a | stuck up their ***, a really big one!

Global-warming, poverty, wars... No time for all that.. We've got citizens putting | into their filenames!!
Sorry, I couldnt resist.. Your post gave me quite the chuckle.. Had to get you back.
There's definitely no version of Windows that allows | in filenames.

Anywho, here's the batch in its original form, along with my best effort in trying to describe it:

@echo off
cd "C:\YourFolderPath"
set reg1=/Regexp:"^([A-Z]{2,})-([A-Z][^A-Z -]*?)([A-Z].*)$:\1|\2 \3"
set reg2=/Regexp:"^(.*[|][^-]*)-([A-Z][^A-Z -]*?)([A-Z].*)$:\1, \2 \3"
set max10=%reg2% %reg2% %reg2% %reg2% %reg2% %reg2% %reg2% %reg2% %reg2%
set reg3=/Regexp:"(.*[|].*), (.*):\1 and \2"
set reg4=/Regexp:"(.*)[|](.*):\1 - \2"
brc32 /Pattern:*-*.docx %reg1% %max10% %reg3% %reg4% /Execute
pause>nul

BRC's commandline is basically: brc32 /Pattern /Reg1 /Reg2(many-times) /Reg3 /Reg4 /Execute

BATCH TECHNIQUE ***
If I plan on using many regexes, or repeating 1 many times, I usually set them as a variable.
Setting var=aaabbbcccddd just lets you type %var% instead of aaabbbcccddd wherever you want.
It makes the brc-line easier to read/edit, especially when it comes to rearranging the regex-order.

Another very helpful technique, since /Pattern is so limited in its file selection:
I use the 1st-regex as a file-matcher by inserting | or some other illegal filename character.
Then make all subsequent regexes match for | as they continue processing the filename in memory.

This ensures that nothing in your regex-chain will touch anything the 1st-regex didnt match.
You can even get fancy, having each regex add another | as a true-condition for the next regex, & so on.

Nothing ever gets renamed until the /Execute param, everything else processes in memory.
You can put multiple /Execute's on brc's commandline, but I rarely use it that way.

RENAME LOGIC ***
One thing I try to do is think ahead about NewName-formats & design my 1st-regex to NOT match them.
That way, you dont have worry about protecting already-renamed-files by moving into other folders.
Its not always possible, but definitely worth the effort, especially when helping newcomers.

BRC's regex is specified as /Regexp:"Match:Replace"
I figure its easier to convert the batch regexes into the BRU-format usually posted here.
In all cases, the below equivalents exactly match those in the batch (colon/quotes removed).

When Match begins with ^ it means: Match-only-as-beginning-text.
If the Match ends with $ it means: Match-only-as-ending-text.
Please disregard commas in the descriptions, they're just for readability.
When you see OrigName --> NewName, that's just whats happening in memory at the time.


%reg1%
^([A-Z]{2,})-([A-Z][^A-Z -]*?)([A-Z].*)$
\1|\2 \3
Match: (2orMoreUppercase)-(1Uppercase,AnythingBesides[Uppercases,Spaces,Dashes] UntilNext)(1Uppercase,Anything)
Rplace: Group1|Group2 Group3
GER|MartinSchmidt-NinaWagner-RonaldStanford-JoeMahoney ---------> GER|Martin Schmidt-NinaWagner-RonaldStanford-JoeMahoney

Since we create a space in NewName (it'll stay there) the AnythingBesides[Uppercases,Spaces,Dashes] ensures we cant touch previous renames.
The | in replace 'tags' the filename.. All future regexes will seek it in their match.. They cant process anything this one doesnt tag.
I figure most songs only have 1 ArtistName, so this also converts the 1st-occurence of ArtistName --> Artist Name


%reg2% --- BRC processes this regex 9 times as %max10%
^(.*[|][^-]*)-([A-Z][^A-Z -]*?)([A-Z].*)$
\1, \2 \3
Match: (AnythingTillLast|,AnythingBesides[Dashes]Until)FirstDash(1Uppercase, AnythingBesides[Uppercases,Spaces,Dashes] UntilNext)(1Uppercase,Anything)$
Rplace: Group1, Group2 Group3
Run1: GER|Martin Schmidt-NinaWagner-RonaldStanford-JoeMahoney -----> GER|Martin Schmidt, Nina Wagner-RonaldStanford-JoeMahoney
Run2: GER|Martin Schmidt, Nina Wagner-RonaldStanford-JoeMahoney ---> GER|Martin Schmidt, Nina Wagner, Ronald Stanford-JoeMahoney
Run3: GER|Martin Schmidt, Nina Wagner, Ronald Stanford-JoeMahoney -> GER|Martin Schmidt, Nina Wagner, Ronald Stanford, Joe Mahoney
Run4: No effect, the Dash inbetween (Group1)-(Group2) can no longer match

Each run coverts the 1stDash into CommaSpace, then 1Space is inserted between Group2(Firstname) & Group3(LastnameEverythingElse)
The AnythingBesides[Uppercases,Spaces,Dashes] is too strict, it could've been AnythingBesides[Uppercases] or even .*?
At the time I was concerned about matching names like JuliaLouis-Dreyfus.. I decided against it, but left %reg2% over-complicated.
I can be abit air-headed sometimes.. If I were concerned of such names, I shouldnt 'tag' them in the 1st-place!
A much easier to read %reg2% match would be:
^(.*[|].*)-([A-Z].*?)([A-Z].*)$

If you want, test it out against your massive list of names, I think its easier to follow.
I'm used to overkilling as a precaution.. Its safer, but hurts readability.. A couple more bad habits:
Originally you couldnt lazy-match .*?X to find the 1st-X, so we had to use [^X]*X instead.. Identical, just harder to read.
Also, by dashes I do mean hyphen, I know its the correct name.. Old habits die hard.

With names like JuliaLouis-Dreyfus, the batch creates "Julia Louis-Dreyfus", causing all future %reg2%'s to stop matching.
So you'd end up with something like: GER - First Last, First Last, and Julia Louis-Dreyfus-First-Last-First-Last
I did write a better version that doesnt 'tag' them in the 1st place.


%reg3%
(.*[|].*), (.*)
\1 and \2
Match: (AnythingUntilLast|AnythingTillLast)CommaSpace(Anything)
Replaces a final CommaSpace --> SpaceandSpace
GER|Martin Schmidt, Nina Wagner, Ronald Stanford, Joe Mahoney ---> GER|Martin Schmidt, Nina Wagner, Ronald Stanford and Joe Mahoney

Without a tag, this regex could be dangerous, since any .docx filename matched by /Pattern with CommaSpace would indeed be affected.
Thats another benefit to tagging, it often simplifies secondary regexes (instead of re-checking for a complicated-match, they just match the tag).

%reg4%
(.*)[|](.*)
\1 - \2
Match: (AnythingUntilLast)|(Anything)
The tag-killer: Replace | with -
All done, no more need for the tag.. Per the brc-commandline, processing now goes from %reg4% to /Execute
GER|Martin Schmidt, Nina Wagner, Ronald Stanford and Joe Mahoney -> GER - Martin Schmidt, Nina Wagner, Ronald Stanford and Joe Mahoney


Sorry for not making it easier to read.. Its the formatting that took so long.. Kept getting lost in the text.
Trust me, typing this post was at least 100x harder than just writing some batch/brc commandlines.
If someone told me I had to write an entire manual, I'd be like "You might as well just shoot me now, cuz it aint gonna happen!"

Here are some helpful regex techniques that I dont often see posted, thought I'd mention them.
You may already have them in the manual.. Stopped reading after the case-insensitivity.. Had to try it out!


GROUP FORMAT-MATCHING
If you wanna match something like AA11BB22CC33DD44 (any repeating series of Any2UpperCase,Any2Digits):
([A-Z{2}][\d]{2})+ Its the numbered-occurence modifiers like + or {#} that I almost never see.
It lets you spec a groups repeating-format (just the format-itself, not the exact text).

So if you wanted the keep only the 10th occurence of such a repeating format:
^(.*?)([A-Z]{2}[\d]{2}){10}([A-Z]{2}[\d]{2})+(.*)$ with \1\2\4
beginAA11BB22CC33DD44EE55FF66GG77HH88II99JJ00KK11end ---> beginJJ00end

If you wanted to keep everything besides the 10th occurence when there's 11-or-more occurences:
^(.*?)(([A-Z]{2}[\d]{2}){9})([A-Z]{2}[\d]{2})(([A-Z]{2}[\d]{2})+.*)$ with \1\2\5
beginAA11BB22CC33DD44EE55FF66GG77HH88II99JJ00KK11end --> beginAA11BB22CC33DD44EE55FF66GG77HH88II99KK11end

Since BRU doest support numbered-match-specifiers in the replacement, this sometimes provides a nice workaround.
Not to mention BRU's capture-limit of 9Groups.. The {9} could just as well have been {20} with extremely long repeats.
The main thing is, if you need to capture the 1st 9Groups, you need: ((Group){9})
You can also use ([A-Z]{2}[\d]{2})+ in your #12 with regex to show names with 1-or-more of the format-repeats.
Anywho, almost never see it posted.

I used this method to write an improved version of this batch to filter out names like JuliaLouis-Dreyfus.
Basically, it makes sure that only 2Uppercase may preceed any - in filename (or end of filename).
(exluding the first dash right after country-code)

EXACT-REPEAT MATCHING or case-insensitive matching using the (?i) modifier
Something else I rarely see is putting \2 etc into the match for exact-repeats of a previous group's match.
Using (ABCD)(anytext)(\2)(anytext) Group3 would have to be ABCD, so you could omit either \1 or \3 to kill the repeat.
Of course, if you wanted to kill the the 2nd-occurence, use \2 not (\2).

Hope it all makes sense.. We appreciate all that you're going through for the manuals.. Hopefully this can give you some good ideas.
Any questions, please feel free to ask.
bru
 
Posts: 62
Joined: Wed Jan 31, 2018 7:35 pm

Previous

Return to Regular Expressions


cron