Help please with Regex filter - mine is too greedy?

A swapping-ground for Regular Expression syntax

Help please with Regex filter - mine is too greedy?

Postby psantucc » Fri Jul 23, 2021 7:23 pm

Been struggling with this filter, attempting to create a file list for export and further manipulation.

I have datasets that are delimited by " - " (space, hyphen, space). I want to find all the files that include a comma before the first delimiter.

I'll spare you the full file listing, and just show a shortened example of what I get with my regex attempt:

Regex ^(?:.+?),(?:.+?)(?: - )

-----
Bee Gees, The - Staying Alive - BL36-02
Cher - Shoop Shoop Song, The - BL36-09
Cocker, Joe - Up Where We Belong - BL36-05
Cocker, Joe - You Can Leave Your Hat On - BL36-06
Collins, Phil - You'll Be In My Heart - BL36-11
Houston, Whitney - Waiting To Exhale - BL36-08
Jackson, Joe - Is She Really Going Out With Him - BL36-01
John, Elton - Circle Of Life, The - BL36-10
McKee, Maria - Show Me Heaven - BL36-14
Midler, Bette - Rose, The - BL36-03
O'jays, The - Love Train - BL36-15
Richie, Lionel - Endless Love - BL36-07
Streisand, Barbra - Evergreen - BL36-04
Withers, Bill - Ain't No Sunshine - BL36-12
--------

If the regex were doing what I wish, this file:
Cher - Shoop Shoop Song, The - BL36-09

Would not appear, as there is no comma before the first delimiter.

All clues welcome - and bonus points for a regex that would also filter out files for which "The", "A", or "An" follows the comma before the first delimiter.
psantucc
 
Posts: 4
Joined: Thu Jul 22, 2021 11:09 pm

Re: Help please with Regex filter - mine is too greedy?

Postby Luuk » Sat Jul 24, 2021 1:10 pm

Im guessing that you probably want a 'Mask' like... ^((?! - ).)+, (?!(The|A|An)\b).+ - .+
This will forbid presenting any names without a comma before the very first delimiter.
It also forbids words like 'The' or 'An' after that comma, but grants words like 'Theodore' or 'Angel'.

If it helps for understanding, the red-parts is called negative look-aheads, and they do say what not to match.
The \b makes sure that The|A|An terminates a word, and since we already put space in front, they must be a whole word.
Luuk
 
Posts: 692
Joined: Fri Feb 21, 2020 10:58 pm

Re: Help please with Regex filter - mine is too greedy?

Postby psantucc » Sun Jul 25, 2021 5:55 pm

Im guessing that you probably want a 'Mask' like... ^((?! - ).)+, (?!(The|A|An)\b).+ - .+


That works a treat! Thanks very much, and for the explanation. Knowing that I'm working with look-aheads is key.
psantucc
 
Posts: 4
Joined: Thu Jul 22, 2021 11:09 pm

Match only if comma comes before a first delimiter string

Postby Luuk » Mon Jul 26, 2021 8:35 am

Without the look-aheads its mostly like yours, but with only having enough groups needed to conduct a 'Mask'.
If anybody wants to adapt this for RegEx(1), its the first group to be the most important for understanding.
To verify no delimiter-string comes anywhere before comma, both '.' and a look-ahead must be inside the same group.

Without a look-ahead, its just... ^(.)+, .+Delimiter.+
So we insert a look-ahead like... ^((?!Delimiter).)+, .+Delimiter.+

Now being inside the same group, every time that '.' matches 1-character, the look-ahead will verify that...
The 1-character is not 'D' and that 'elimiter' does not follow, before matching comma, and the rest of your expression.
If wanting to group everything before comma, just add an outer-group to surround the '+', so maybe something like...

^((?:(?!Delimiter).)+), (.+?)(Delimiter)(.+)
\1 == Everything before 'comma,space' (but only if 'comma,space' is before the first-delimiter).
\2 == Everything after 'comma,space' until the first-delimiter (or use .+ for last-delimiter).
\3 == The first-delimiter
\4 == Everything else

Without putting the look-ahead and '.' in the same group, the look-ahead only forbids names starting exactly like "Delimiter"
So hopefully this can save someone else the troubles, when trying to experiment with having different delimiters.
Luuk
 
Posts: 692
Joined: Fri Feb 21, 2020 10:58 pm


Return to Regular Expressions