Using Title Enhanced Case with Roman Numerals

Bulk Rename Utility How-To's

Re: Using Title Enhanced Case with Roman Numerals

Postby Admin » Fri Apr 23, 2021 3:54 am

Thanks, in the next update we will work on improving Roman Numbers identification when using <rnup> and <rnlo>, because not every word made of X V M C I etc. is actually a valid Roman Number, more logic has to go into that.
Admin
Site Admin
 
Posts: 2343
Joined: Tue Mar 08, 2005 8:39 pm

Re: Using Title Enhanced Case with Roman Numerals

Postby vshanecurtis » Fri Apr 23, 2021 4:29 am

I am glad that my post was helpful
vshanecurtis
 
Posts: 17
Joined: Fri Apr 16, 2021 5:35 am

Re: Using Title Enhanced Case with Roman Numerals

Postby Luuk » Mon Apr 26, 2021 11:00 am

Many thanks again vshanecurtis! Im thinking this issue to soon be solved thanks to your post.
If ever you need help with anything about the regex, please to let me know and I will do my best, and I think the word separators is soved!!

So first, this is my best way trying to descibe how Im matching the Roman-Numerals...
A Roman-Numeral has four sections (that Im calling 'digits') but only one section must exist.
Each section must come from the strings below, and when the sections exist, they must be in order.

<Digit>: <------------Possibile Strings-----------------> ...... Match
1stDigit: M,MM,MM,nothing................................... M{0,3}
2ndDigit: C,CC,CCC,CD,D,DC,DCC,DCCC,CM,nothing .... (D?C{0,3}|C[DM])?
3rdDigit: X,XX,XXX,XL,L,LX,LXX,LXXX,XC,nothing ......... (L?X{0,3}|X[LC])?
4thDigit: I,II,III,IV,V,VI,VII,VIII,IX,nothing .................. (V?I{0,3}|I[VX])?

So MXX is valid because M belongs to 1stDigit, and XX belongs to 3rdDigit, and missings sections are always legal.
MXXCC is invalid because M belongs to 1stDigit, XX belongs to 3rdDigit, and CC belongs to 2ndDigit, so out of order.
Believe it or not, this was the easiest part of the regex once Im finally understanding the Roman-Numerals (thanks again).

The hardest part was the word separators, because my regex is still using \b and underscore to find the word boundaries.
And the regex administrators decided that ' should be a word separator, even with many languages using "contraction words".
So with names like "Stop or I'm leaving", the regex thinks that both I and m are separate words, so capitalizing both of them!

At first, Im just making exceptions for common words, but then learning other languages also need exceptions, and having very few rules.
So instead, Im decided to just make one main rule for myself, so I dont know if its good solution for most of the other BRU users:
Never capitalize Roman'Word or Word'Roman, but capitalize "word1 'Roman' word2" (Roman cant touch another word with apostrophe).

So Im thinking the only users who dont like this, to be users with formats like word1'Roman'word2 in their filenames?
You could add some more matching to also grant 'Roman' (when ' is always on both sides) but Im thinking its too much?
If anybody has more ideas about this, please to present them, because this really is a difficult part for me.

I do feel bad for the programmers, or whoever must decide about the rules, because it really is a difficult part.
Im trying to invent something to make everybody satisfied, but really its not possible for everybody at the same time.
So this "v2"regex will capitalize ALL Roman-Numerals, except when Roman-Numerals do touch another word with apostrophe...

((\b|_)(?<!\w')(?=.)M{0,3}(D?C{0,3}|C[DM])?(L?X{0,3}|X[LC])?(V?I{0,3}|I[VX])?(?!'\w)(?=\b|_))/ig
\U$1

Lol, it looks so simple now, so its embarassing because it was actually hundreds of characters long during the experiments.
Then I finally discovered some things that Im doing wrong with lookarounds, because putting them in the wrong places.

============================================================================
So this how it altered many different names, and trying to present the changes in red...
beethoven's niNTH syMPHONY part iii
beethoven's niNTH syMPHONY part III

civilzation civ iii civil mimic
civilzation CIV III civil mimic

did mi mix mixing iii mimic dd civil
did MI MIX mixing III mimic dd civil

superman-from-dc-comics-costume
superman-from-DC-comics-costume

di did iii mi mimic civ civv
DI did III MI mimic CIV civv

di dii princess di princess-di lady_di princess dii
DI DII princess DI princess-DI lady_DI princess DII

i ii m _m_ _i'm_ i'm _i'm (iv) iii i m' 'm i' i 'i' 'm' i'd i'll i
I II M _M_ _i'm_ i'm _i'm (IV) III I M' 'M I' I 'I' 'M' i'd i'll I

i m 'i i' 'm m' i'm _i'm_ 'ii' 'i i' i'd 'id' i'll i'm you'd you'x
I M 'I I' 'M M' i'm _i'm_ 'II' 'I I' i'd 'id' i'll i'm you'd you'x

ii iiii m_m_i'm_ (iv) 'iii' i'd i'll you'd x xx xxx xxxxx
II iiii M_M_i'm_ (IV) 'III' i'd i'll you'd X XX XXX xxxxx

ii_ii_iiiii_xx(iv)_xxxxxx
II_II_iiiii_XX(IV)_xxxxxx

luuk_with_long_hair!d
luuk_with_long_hair!D

king henry 'vi'
king henry 'VI'

King Henry 'mmmdccclxxxviii'
King Henry 'MMMDCCCLXXXVIII'

King Henry 'mmmdccclxxxviiiii' vi
King Henry 'mmmdccclxxxviiiii' VI

Joey with Liv in the park
Joey with LIV in the park

i-needed-20-ml-of-medicine
I-needed-20-ML-of-medicine

12 cc of fluid
12 CC of fluid

princess_di princess di princess dii _princess_di
princess_DI princess DI princess DII _princess_DI

=========================================================================
These names never changed because the Romans-Numerals always touching another word with apostrophe...
'i'd_i'd_i'd_i'm_eee_i'm_x'm'vi'
_i'm_i'm_i'm_i'm i'd_i'd_i'd_i'm_i'm
i'm i'm i'm i'm_i'm_i'm_i'm_i'm
joe'xxx'iii'xxx'xxx'man
king henry'vi'
king Henry'vi_
you'd better or i'll leave

=========================================================================
Im also experimented with exceptions for things like "princess di" or when cc|ml comes after the numbers...
princess_di princess di princess dii _princess_di.jpg
princess_di princess di princess DII _princess_di.jpg (not di after princess)

king henry cc needs 12 cc of fluid
king henry CC needs 12 cc of fluid (not cc after numbers)

king henry-ml needs 20-ml-of-fluid
king henry-ML needs 20-ml-of-fluid (not ml after numbers)

But really, Im thinking users should just enter exceptions like this into the exception box instead?
So for now, just save the settings into a name like RnUp, and its probably good enough until the next update.
Many thanks again to vshanecurtis, I think the romans would be proud!
Luuk
 
Posts: 690
Joined: Fri Feb 21, 2020 10:58 pm

Re: Using Title Enhanced Case with Roman Numerals

Postby vshanecurtis » Mon Apr 26, 2021 7:36 pm

I am glad that I was able to help. One other thing did come to mind that isn't related to the Roman Numerals issue. It's am Enhanced Title Case Issue. There should be an exception that says ignore all file names that are all Caps. There are situations where a file name is Capped and done so for a reason. As an example, Rush's YYZ Instrumental. The title is Capped because YYZ is the airport code for Toronto International Airport and the song is about the airport.
vshanecurtis
 
Posts: 17
Joined: Fri Apr 16, 2021 5:35 am

Re: Using Title Enhanced Case with Roman Numerals

Postby Luuk » Mon Apr 26, 2021 11:27 pm

I think this could be a good enhancement because some users will upper a word, even when its not really a rule, but just for the eyesight. So maybe just call it something like <UpWordsStay> so then all the other users will not have to worry about it.

This would be very easy with regex, but really I dont if Case(4) gets invented from regex?? I wanted to invent a regex for "Title Enhanced" but not knowing the rules, so another user provided links about them. Then I discovered there was entire books just about the 'NYT Title Rules', so I never started the experiment.

I do like this for an exception rule.. Its really a good suggestion, and also maybe easy to conduct?
It might be better to post this in suggestions, because not knowing if this will get noticed here?
Luuk
 
Posts: 690
Joined: Fri Feb 21, 2020 10:58 pm

Re: Using Title Enhanced Case with Roman Numerals

Postby vshanecurtis » Mon Apr 26, 2021 11:43 pm

I see your point it is an odd ball situation.
vshanecurtis
 
Posts: 17
Joined: Fri Apr 16, 2021 5:35 am

Re: Using Title Enhanced Case with Roman Numerals

Postby Luuk » Tue Apr 27, 2021 9:27 am

I like the suggestion because Im thinking other users will also like to keep certain words in capitals, and it sounds like something easy to code.
Without something like <UpWordsStay> the only way to conduct this is entering many different word-separator exceptions into the exception box.
So if the word was "YYS", you must type many things like : YYS :_YYS_:-YYS-: YYS-:-YYS :(YYS):[YYS]:_YYS : YYS_: and many other combinations.

Since ":" doesnt say 'word-separator', then using only :YYS: would always capitalize any yys, even when its only part of a word.
So if regex is used to invent these exceptions, it should be very easy to code this, but of course Im no idea if thats actually true.
Luuk
 
Posts: 690
Joined: Fri Feb 21, 2020 10:58 pm

Re: Using Title Enhanced Case with Roman Numerals

Postby BogStandard » Fri Apr 30, 2021 2:20 pm

Hi Luuk.

I think this Roman Numeral RegEx solution is super neat. That is, I didn't understand pretty much any of it. So I decided to break it down to see if I could get my head around it and learn from your research and implementation.

For anyone else, this is my deconstruction and interpretation...

PREAMBLE:
\w means the 63 characters [A-Za-z0-9_]; in the text below it is 'word-character'.
\b means a word boundary; in the text below it is 'word-boundary'. ###
. means a single character.
? has several meanings depending on its position and/or neighbouring characters.

The Lookahead and Lookbehind commands begin with:
(?= which is a positive lookahead (check that the next character is something)'
(?! which is a negative lookahead (check that the next character is not something)'
(?<= which is a positive lookbehind (check that the previous character is something)'
(?<! which is a negative lookbehind (check that the previous character is not something)'
and they all close with a paired ). ##


SHORT BREAKDOWN:
RegEx(1) v2
Match
((\b|_)(?<!\w')(?=.)M{0,3}(D?C{0,3}|C[DM])?(L?X{0,3}|X[LC])?(V?I{0,3}|I[VX])?(?!'\w)(?=\b|_))/ig

I have split the RegEx Match into six blocks:
Block 1: ((\b|_) ==> starting at a word-boundary
Block 2: (?<!\w')(?=.) ==> there wasn't an ' mid-word just before here
Block 3: M{0,3}(D?C{0,3}|C[DM])?(L?X{0,3}|X[LC])?(V?I{0,3}|I[VX])? ==> match a valid Roman Numeral
Block 4: (?!'\w) ==> there isn't an ' mid-word next after here
Block 5: (?=\b|_)) ==> there is a word-boundary next after here
Block 6: /ig ==> ignore character-case and catch all occurrences in the filename

The RegEx Replace is one part:
\U$1 ==> make the match UPPERCASE

Anything that didn't match a Roman Numeral passes through unaltered.


LONG BREAKDOWN:
RegEx(1) v2
Match
((\b|_)(?<!\w')(?=.)M{0,3}(D?C{0,3}|C[DM])?(L?X{0,3}|X[LC])?(V?I{0,3}|I[VX])?(?!'\w)(?=\b|_))/ig

==> Block 1:
( ==> START Capture_group_1
(\b|_) ==> at a word-boundary OR underscore **

==> Block 2:
(?<!\w') ==> check the previous two characters weren't a word-character with an ' ***
(?=.) ==> check there is a character next ****

==> Block 3:
==> Capture this sequence if there is:
M{0,3} ==> none to 3 M's
==> followed by
(D?C{0,3}|C[DM])? ==> (none to one D, AND none to three C's) OR (one C AND [one D OR one M]) all this group none to one time
==> followed by
(L?X{0,3}|X[LC])? ==> (none to one L, AND none to three X's) OR (one X AND [one L OR one C]) all this group none to one time
==> followed by
(V?I{0,3}|I[VX])? ==> (none to one V, AND none to three I's) OR (one I AND [one V OR one X]) all this group none to one time

==> Block 4:
(?!'\w) ==> check the next two characters are not a ' with a word-character ***

==> Block 5:
(?=\b|_) ==> check the next character is a word-boundary OR underscore **
) ==> END Capture_group_1

==> Block 6:
/ig ==> /i makes the regular expression case-insensitive and /g globally matches the pattern repeatedly in the filename and does not stop at the first match.

Replace
\U$1 ==> \U Convert to UPPERCASE $1 Capture_group_1 (which is a valid Roman Numeral)

Anything that didn't match a Roman Numeral passes through unaltered.


NOTES:
** _ is a word-character but Luuk is including it as a word-boundary in his implementation
*** Checking there isn't a ' INSIDE the sequence of characters
**** Checking we are not at the end of the filename (ie the next character is NOT a newline code)

## Here the Lookahead and Lookbehind matches are NOT captured in the Capture_group_1
###Fromhttps://perldoc.perl.org/perlrebackslash#%5Cb%7B%7D,-%5Cb,-%5CB%7B%7D,-%5CB see mid-page
\b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)). That is, a word-boundary is defined in terms of word-characters only.
You could substitute the \b with (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)) and the RegEx result will be the same.


Regards...
BogStandard
 
Posts: 17
Joined: Sun Feb 07, 2021 11:25 pm

Re: Using Title Enhanced Case with Roman Numerals

Postby Luuk » Wed May 05, 2021 9:58 am

Yes, the word boundaries and word characters I really did not completely understand until conducting this experiment.
Its very unfortunate that the regex administrators decided to make the underscore a word character, but not the apostrophe?
To me, this seems like it should be the opposite, so I really do hope they will soon reconsider this decision.

I can understand that "contraction" says "joined words", but the contraction is still only just one word.
And also some languages can shorten long words with apostrophe like Möchengladbach ===> M'gladbach.
But I did never find any languages that use the underscore to spell their words anywhere!

It almost seems like the aliens studied our language to invent the regex, but then made mistakes with the underscore and apostrophe.
So I spent many searches on the internet, looking for groups like (?\w:') or (?\w:!_) to say things like...
"Grant the apostrophe as a word-character" -or- "Forbid the underscore as a word-character".

But everyone was saying lookarounds to be the only way, so my first experiments took a very long time.
The problem was that Im never wanting to alter anything inside of (boundary)(RomanGroups)(boundary).
So instead, all of my experiments were looking like... (Lookahead)(boundary)(RomanGroups)(boundary)(Lookbehind).

The lookahead conducted properly with (?!_*[c-x]+'\w) to say 'apostrophe cant touch a word-character on the right'.
But when Im trying a lookbehind with (?<!\w'[c-x]+) to say 'apostrophe cant touch a word character on the left, it always failed.
The regex would always report that using + was invalid, and then it would destroy my whole entire expression!
(The [c-x] could have been \w, but Im just using c-x to present for the eyesight where Im matching the Romans).

So dont ask me why, but the lookbehind only wants to look-behind one exact number of characters (never 1-or-2, or 1-or-more, etc).
Im even experimented with things like {1,15} to say 1-through-15, but it did always forbid ranges, and only granted one exact number.
Im not found anything about this looking in the manuals, so maybe its the regex administrators, or maybe its BRU limitation??

This makes the experiments very difficult, because instead of just using one lookbehind, Im using very many like...
(?<!\w'[c-x])(?<!\w'[c-x]{2})(?<!\w'[c-x]{3})(?<!\w'[c-x]{4})(?<!\w'[c-x]{5})(?<!\w'[c-x]{6})(?<!\w'[c-x]{7})(?<!\w'[c-x]{8})
Except Im going all the way to {15}, just in case of needing to match the very longest possible Roman-Numeral.

Then Im trying to add exceptions, but its getting very much longer and complicated, so then Im asking this question... viewtopic.php?f=4&t=5494
Finally, I learned to stop using (Lookahead)(boundary)(RomanGroups)(boundary)(Lookbehind), because both lookarounds must again match the Romans!
But with (boundary)(Lookbehind)(RomanGroups)(Lookahead)(boundary), now the lookarounds only worry about matching outside the boundaries.

Hopefully this logic can help other users, because it was the hardest part for me to understand, about where to best settle your lookarounds in the expression.
And especially how lookbehinds will forbid looking-behind more than "one exact number of characters", so always forbidding things like *, +, or {1,15}.




Also, you can change both \w ===> [a-z\d] to forbid matching underscores, because \w conducts like ...
iii _'iii'_ _'iii' 'iii'_ a'iii'_ iii ===========> III _'iii'_ _'iii' 'iii'_ a'iii'_ III .................(because \w matching underscore as part of a word).
SomeUnicode'iii iii'SomeUnicode iii ===> SomeUnicode'iii iii'SomeUnicode III .......(because \w matching unicode word-characters)

I prefer using \w like above, because [a-z\d] cant match unicodes, and [a-z\d] conducts like ...
iii _'iii'_ _'iii' 'iii'_ a'iii'_ iii ==========> III _'III'_ _'III' 'III'_ a'iii'_ III ...................(because [a-z\d] not matching _ as word-character)
SomeUnicode'iii iii'SomeUnicode iii ===> SomeUnicode'III III'SomeUnicode III ....... (because [a-z\d] not matching unicodes!)

So for me, Im preferring \w for a regex to conduct "not enough", instead of [a-z\d] that conducts "too much".
And you could always add another (?X)expression to fix the underscores, but for now Im still studying the unicode characters.
Im hoping to discover a way to match unicodes inside of [a-z\dhere], so then everything can be inside of just one expression??
Luuk
 
Posts: 690
Joined: Fri Feb 21, 2020 10:58 pm

Re: Using Title Enhanced Case with Roman Numerals

Postby BogStandard » Sun May 09, 2021 12:02 am

Hi Luuk.

Because I do not want to change any Roman Numerals that are enclosed in non-space characters this is my solution:

RegEx v2
Match
((^| )M{0,3}(D?C{0,3}|C[DM])?(L?X{0,3}|X[LC])?(V?I{0,3}|I[VX])?(?= |$))/ig

Replace
\U$1
for UPPER case and

\L$1
for lower case.

Of course this is developed from from your excellent solution.

Regards...
BogStandard
 
Posts: 17
Joined: Sun Feb 07, 2021 11:25 pm

Re: Using Title Enhanced Case with Roman Numerals

Postby Luuk » Mon May 10, 2021 3:47 pm

I think your space solution is very practical for most users, because its the most common word separator. Also, your format grants other users to specify more separators by changing 'space' ==> [ MoreSeparators] except being careful with '-' because if not first, like [- MoreSeparators] then regex will think its a range.

Its unfortunate, but Im not having any luck to match the unicodes, even with experts saying to just replace \w ==> \p{L}.
The \p{L} should also match any unicode letter, but in the experiments its only matching the non-unicode lowercase letters!
So for now Im still using \w, but if anyone has ideas or links about how to better match the unicodes, its to be much appreciated.
Luuk
 
Posts: 690
Joined: Fri Feb 21, 2020 10:58 pm

Previous

Return to How-To