Getting help with Regular Expressions

A swapping-ground for Regular Expression syntax

Getting help with Regular Expressions

Postby Admin » Tue Nov 08, 2005 10:49 am

Regular Expressions, or RegExps, are extremely powerful, but can also be extremely complicated.

I have yet to find a problem which a RegExp cannot solve, but you might have to "play around" for a while until you get the solution you desire.

As RegExps have been around for many years there's a lot of information available, in books and online. There's also some nifty utilities lurking. Here's some starters for you.

--------------------------------------------------------------------------------

For those of you who wish to unleash the power of Regular Expressions, but aren't sure what they are all about, then here's a couple of useful resources:

http://github.com/aloisdg/awesome-regex (A curated collection of awesome Regex libraries, tools, frameworks and software)
http://www.regular-expressions.info/ (useful notes)
http://laurent.riesterer.free.fr/regexp/ (free utility)
http://www.weitz.de/regex-coach/ (donationware, very good)
http://www.regexbuddy.com/debug.html (paid utility)
http://www.robvanderwoude.com/index.html (resources)

I would strongly recommend the Regex Coach as it gives you an excellent demonstration of the components in your expression. Tip: I find it works best with the "g" box ticked.


If you can't achieve what you are looking for then please post here.

Update 2017
Two great regular expression testers:
https://regexr.com/
https://regex101.com/
Admin
Site Admin
 
Posts: 1505
Joined: Tue Mar 08, 2005 8:39 pm

Re: Getting help with Regular Expressions

Postby ThanhLoan » Wed Dec 16, 2009 5:43 pm

Hi,
I might be wrong but BRU seems to have max of 9 groups of replacement. When I have more than 9, I can't specify the group .
exp : I can't write \10 ; BRU will take it as group 1 folowed by a '0'

Please advise..

Regards
ThanhLoan
 
Posts: 7
Joined: Fri Dec 11, 2009 8:22 pm

Re: Getting help with Regular Expressions

Postby Stefan » Wed Jan 13, 2010 7:56 pm

ThanhLoan wrote:Hi,
I might be wrong but BRU seems to have max of 9 groups of replacement. When I have more than 9, I can't specify the group .
exp : I can't write \10 ; BRU will take it as group 1 folowed by a '0'

Please advise..

Regards

That's no limitation by BRU but an common Regular Expression standard: you can only access 9 (...)-groups, \1 till \9 (rsp. $1...$9)

(BTW: Sometimes, if supported/implemented, there is also \0 or $0 to get the whole search term back)




Tip:
you only have to put those search term between (...) round brackets, to which you want to back referencing for.
e.g.
FROM:
Bulk Rename Utility
TO:
Bulk Utility
DO:
RegEx(1)
Search: (.+) .+ (.+)
Repla: \1 \2

Here i want only the result of the first and third group back, so i don't have to take the result of the second term into an group too.

Tip:
If one wants to use the (...) round brackets just to have an better overview,
he can use "grouping without backreferencing." in BRU by prefixing the pattern by '?:':
Example:
FROM:
RegEx(1)
Search: (.+) (
?:.+) (.+)
Repla: \1 \2

Please note that here the second pair of round brackets do capturing NOTHING, they just group an search term.

More explanation:
http://www.regular-expressions.info/brackets.html
If you do not use the backreference, you can optimize this regular expression into Set(?:Value)?.
The question mark and the colon after the opening round bracket are the special syntax that you
can use to tell the regex engine that this pair of brackets should not create a backreference.


http://gnosis.cx/publish/programming/regular_expressions.html
Backreferencing in replacement patterns is very powerful; but it is also easy to use more than nine groups in a complex regular expression.
Quite apart from using up the available backreference names, it is often more legible to refer to the parts of a replacement pattern in sequential order.
To handle this issue, some regular expression tools allow "grouping without backreferencing."

A group that should not also be treated as a back reference has a question-mark colon at the beginning of the group,
as in "(?:pattern)." In fact, you can use this syntax even when your backreferences are in the search pattern itself.
Stefan
 
Posts: 736
Joined: Fri Mar 11, 2005 7:46 pm
Location: Germany, EU

Re: Getting help with Regular Expressions

Postby Stefan » Wed Jan 13, 2010 10:22 pm

While on testing i did an another test:
BRU's regex implementation allow "named groups" like (?P<description>pattern)

Example

FROM:
Bulk Rename Utility 2010 Test File.txt
TO:
Bulk Rename Utility Test File.txt
DO:
RegEx(1)
Search: (.+) (\d{4}) (.+)
Repla: \1 \3


Here you can also do
RegEx(1)
Search: (?P<SearchAll_Till>.+) (?P<4_digits>\d{4}) (?P<ThenFindTheRest>.+)
Repla: \1 \3


This is handy for storing the regex with comments -describing what they do- for later re-use.
Stefan
 
Posts: 736
Joined: Fri Mar 11, 2005 7:46 pm
Location: Germany, EU

Re: Getting help with Regular Expressions

Postby Stefan » Tue May 22, 2012 9:56 pm

Since this thread is sticky, i use this to link to other posts:


First, i mention here our two very first posts of this regex sub-forum (at the last page in the meantime):

Getting Started - Overview over the RegEx syntax (Regular expressions, regexes. RegExp, RE)
http://www.bulkrenameutility.co.uk/forum/viewtopic.php?f=3&t=5

Code: Select all
    Regular expressions (regexes) are like super-intelligent wildcards.
   If you learn regexes, you too can become super-intellegent (and have more fun using BRU).

    WILDCARDS
    ---------
    Wildcards are probably familiar to all BRU users.
    You can experiment with wild cards using Windows Explorer:
    Windows Explorer > "Search" dialog box > "Search for files or folders named"
    Summary:
    ? means "any single character"
    * means "zero or more characters"

    ------------------------------------------------------------
    Expression  Matches                      Doesn't match
    ----------  ---------------------------  -------------------
    Notes*      Notes Notes_2005_0302.txt    aNotes
    *Notes*     Notes Notes_2005_0302.txt
    ?Notes*     aNotes aNotes_2005_0302.txt  Notes_2005_0302.txt
    ------------------------------------------------------------

    REGULAR EXPRESSIONS
    -------------------
    Regexes are the same general idea as wildcards, but are considerably more powerful.
    Please glance at the table, then refer to the discussion, below.
    -----------------------------------------------
           Equivalent
    Regex  Wildcard    Matches
    -----  ----------  ----------------------------
    cat    cat         The literal characters "cat"
    .      ?           Any single character
    ..     ??          Any two characters
    ...    ???         Any three characters
    .*     *           Zero or more characters
    ..*    ?*          One or more characters
    ...*   ??*         Two or more characters
    .+     ?*          One or more characters
    ..+    ??*         Two or more characters

    Discussion:
    * means "the preceding character occurs zero or more times"
    + means "the preceding character occurs one  or more times"
    Why would you ever want a character to occur zero or more times? It means that the character is optional. For example: ru.*n matches run, ruin, and ruffian
    On the other hand, ru.+n matches ruin and ruffian, but not run (because we need at least one character between the "u" and "n".)
    Note that there are two equivalent ways of saying "one or more characters": ..* and .+
    -----------------------------------------------

    There are no wildcard equivalents to the following regexes (at least, not in Windows Explorer).
    ----------------------------------------------------------------
    Regex          Matches
    -------------  -------------------------------------------------
    \.             A period.
    \t             A tab character
    \n             A newline character
    ca?t           "c" followed by zero or one  "a", followed by "t"
    ca*t           "c" followed by zero or more "a", followed by "t"
    ca+t           "c" followed by one  or more "a", followed by "t"
    [efgh]         any one of efgh
    [e-h]          any one of efgh
    [a-cF-H]       any one of abcFGH
    [e-h]*         any one of efgh, occurring zero or more times
    [e-h]+         any one of efgh, occurring one or more times
    [a-c][e-h]+    any one of abc; followed by any one of efgh, occurring one or more times
    ([a-c][e-h])+  any one of abcefgh, occurring one or more times.

    Discussion:
    \   \ in front of a regex operator changes it to an ordinary ascii character.
    \   \ Also refers to non-printable ascii characters such as tab and newline (\t and \n).
    ?   means "the preceding character occurs zero or one time"
    ? * + are called "quantifiers" because they specify the number of times a regex expression must occur
    []  always refers to a single character, picked from all those in the square brackets
    ()  parentheses are used for grouping expressions together.
    ()+ means the expression in the parentheses occurs one or more times
    ----------------------------------------------------------------

    BACKREFERENCING!!!!
    -------------------

    In addition to "grouping", there is a second, more powerful use for parentheses, called "backreferencing". The idea is that you can save the matching characters to be used later. For example, suppose you want to change date format from 12-31-2005 to 2005_1231.
    Use this as your "search-regex":
    (12)-(31)-(2005)
    and use this  as your "replace-regex":
    \3_\1\2
    In backreferencing, \1 always refers to the contents of the first pair of parentheses in the search-regex, \2 refers to the contents of the second pair, and \3 to the contents of the third pair.

    Understanding and using backreferencing is essential if you want to take advantage of the powerful regex capability of BRU.

    OTHER PROGRAMS
    --------------
    Here are some programs that can help you get comfortable with regexes, before you start changing your filenames with BRU.

    1. TextPad is shareware with an unlimited trial duration [url]http://www.textpad.com/[/url]
    This is my favorite text editor. The main thing you need to know is that the grouping symbol is \( \) instead of (). Otherwise, regex-gurus-in-training can assume Textpad regexes are identical to BRU.
    To get started:
    - Open a text file
    - Search menu > Find...
    -   Make sure that you've selected the "Regular expressions" check box.
    -   Type in a regex, and click the Find Next button.

    To try out the above backreferencing example:
    In a text file, type 12-31-2005
    - Search menu > Replace...
    -   Make sure that you've selected the "Regular expressions" check box.
    -   Find what: \(12\)-\(31\)-\(2005\)
    -   Replace with: \3_\1\2
    -   Click the "Find Next" button
    -   Click the "Replace" button.


    2. Visual Regex [url]http://laurent.riesterer.free.fr/regexp/[/url]
    Visual Regex is unique because it highlights each regex group () with a different color, then highlights the matching text in the same color. This lets you see what group is matching what text, helps you debug the regex, and helps you learn more about regular expressions.

    3. Regex Buddy [url]http://www.regexbuddy.com/[/url]

    4. Regex Coach [url]http://weitz.de/regex-coach/[/url]

    5. Regex Designer [url]http://www.radsoftware.com.au/regexdesigner/[/url]

    6. The Regulator [url]http://regex.osherove.com/[/url]


Go ahead - Some interesting sides about reg ex,
http://www.bulkrenameutility.co.uk/forum/viewtopic.php?f=3&t=27


Other threads with examples from common interest will follow:




My RegEx hints:
Expression:

. --> one piece of a sign (char, digit, sign, blank)
a --> the char "a" itself literally
abc --> the string "abc" literally itself
(aa|bb) --> one of the alternatives "aa" or "bb", what ever is found first
(aa|bb|cc)--> one of the alternatives "aa" or "bb" or "cc", what ever is found first
[ab3-] --> one from list ("a" or "b" or "3" or "-") NOTE: the hyphen must be at the very begin, or at the end. NOT in between!
[^ab3-] --> one sign but none from the ones from this list (no "a", no "b", no "3" and no hyphen)
[^-] --> one sign (char, digit, whitespace, punctuation) but not a hyphen
[a-z] --> one lower case char from the range "a", "b", "c", "d".... till "z"
[A-Z] --> the same, but match upper case letters.
[a-d] --> one from the range "a", "b", "c" or "d"
[A-D] --> the same, but match upper case letters.
Note: all this A-Za-z thinggy will only match plain 7-bit ASCII chars (english alphabet), no umlauts or ascents or such.

3 --> the digit "3"
2013 --> the number "2013" literally
[6-9] --> one piece of any digit from the range "6", "7", "8" and "9"
[^6-9] --> one piece of any sign (char, digit, whitespace, punctuation) but not a "6", "7", "8" or "9"

\w --> one any letter, digit or underscore
\d --> one any digit from the range "0", "1", "2", "3" till "9"
\s --> one blank
- --> one hyphen literally
_ --> one underscore,
\. --> one dot literally (the dot has to be "escaped" with an backslash, because it is a RegEx MetaChar (see above))
\\ --> one backslash literally (the backslash has to be "escaped" with an backslash, because it is a RegEx MetaChar)

(...) --> group an expression to apply operators or for backreference. Instead of the three dots write your expression.
(Note: those groups are counted from left to right and can be nested too)

\W, \D, \S, --> opposite of lower case \w \d \s
\W, \D, \S means: match one of ANY sign, but NOT if it is a sign of the character class \w or \d or \s
\W --> match one sign but NOT a word sign, \D --> match one sign but NOT a digit, \S --> match one sign but NOT a whitespace


Note: all of the above match only one single piece of a sign!

To match more than one piece, just double them:

aa --> match two 'a' literally
aaa --> match three 'a' literally
aaaa --> match four 'a' literally
\d\d\d\d --> match exactly four single digits like '1962' or '2013'
... --> (three dots) match three of any (maybe different) signs
\s\s --> match two blanks

or use a another meta sign as quantifier.


Quantifier:
* --> match greedy zero or more times the previous expression
+ --> match greedy one or more times the previous expression
{3} --> match exactly 3 times the previous expression
{3,} --> match greedy but at least 3 times the previous expression
{,5} --> match greedy zero-or-more up to 5 times the previous expression
{3,5} --> match greedy 5, or 4, or 3 times the previous expression
? --> behind * or + or {,} will limit the match to as few as (non-greedy)
? --> behind an expression matches on zero or one occurrence
Example:
\d+ --> match one-or-two-or-three-or....-or-as-may-as-possible pieces of any digit. Like '3', or '42', or '123', or '5782332'
\d* --> match zero(none)-or-one-or-two-or....-or-as-many-as-possible pieces of any digit. Like ' ', or '3', or '42', or '123', or '5782332'
\d{4}--> match exactly four of any digits. Like '1962' or '2013' or '1234'
\d{2,4} --> match two, or three, or four of any digits. Like '08' or '2013' or '123'. Works greedy, will get you rather '2013' than '08'
\d{2}|\d{4) --> match exactly two or four digit. But tries to match two first and then stops, even on '2013' it will get you only '20' and will never try to match four digits
a{4} --> match four 'a' s
(Ho){3} --> match three times 'Ho' >>> 'HoHoHo'
(the ){2} --> match doubled 'the '

Boundaries:
\b --> Match at word boundary. Example: "\bfun\b" on "my fun function" will match 'fun' only.
\A or ^ --> at start of file name. Example: "^fun" on "fun function" will match first 'fun' only.
\Z or $ --> at end of file name. Example: "on$" on "onto my fun function" will match last 'on' only.


Meta signs:
\ --> use the escape character "\" in front of an meta sign, to match an meta sign itself
Meta signs are: ., \, (, ), [, {, }, +, *, ?, |, ^, $
Example:
\. --> one dot literally
\\ --> one backslash literally


backreference on replacement:
\1 - insert here what was matched by first (...)-group
\2 - insert here what was matched by second (...)-group
\3 ... \9 - insert what was matched by third, fourth, fifth,... till ninth group
(Note: some flavours use $1 syntax instead of \1)
(Note: those backreference groups are counted from left to right and can be nested too)
(1 ... (2 ... (3... ))) (4 ... ) (5 ... ) (6 ... (7 ... ) )


Some RegEx implementation allow additional rules like:
- Named groups: (?<abc>pattern) >> (?P<my_description-here>pattern)
- Non-capturing group: (?:pattern)
- Comments: (?#comment)
- Positive lookahead: (?=pattern)
- Negative lookahead: (?!pattern)
- Positive lookbehind: (?<=pattern)
- Negative lookbehind: (?<!pattern)
(and many more > http://www.regular-expressions.info/refadv.html)


NOTE: for BRU, depending on what you want to do,
you have to match mostly the whole file name, not only the part you are interested in.
Example: "Interpret 2013 - Song title.mp3"
Right way:
Match: "(.+) \d\d\d\d - (.+)"
Replace: "\1 \2"
Wrong way:
Match: "\d\d\d\d"
Replace: ""


Greed, greedy, OR non-greedy, reluctant:
By default, *, ?, +, and {min,max} are greedy because they consume all characters up through the last possible one that still satisfies the entire pattern.
To instead have them stop at the first possible character, follow them with a question mark. For example, the pattern <.+> (which lacks a question mark)
means: "search for a <, followed by one or more of any character, followed by a >". To stop this pattern from matching the entire string <em>text</em>,
append a question mark to the plus sign: <.+?>. This causes the match to stop at the first '>' and thus it matches only the first tag <em>

Example:
".+(\d\d)" on "Album1987" will gets you "87", because ".+" is greedy and matches "19" too.
".+?(\d\d)" on "Album1987" will gets you "19", because the '?' on ".+?" makes that expression non-greedy and matches only till it find two digits firstly.

More example about Greedy lazy match
Code: Select all
Greedy lazy match

The RegEx : "(.+) - (.+)"
Will match: "Artist - Album" - "Title"

Explanation: Match greedy untill the last hyphen.
/(.+) - (.+)/
    1st Capturing group (.+)
        .+ matches any character (except newline)
            Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
    - matches the characters - literally
    2nd Capturing group (.+)
        .+ matches any character (except newline)
            Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
    http://www.debuggex.com/



The RegEx : "(.+?) - (.+)"
Will match: "Artist" - "Album - Title"

Explanation: Match lazy non-greedy untill the first hyphen.
/(.+?) - (.+)/
    1st Capturing group (.+?)
        .+? matches any character (except newline)
            Quantifier: Between one and unlimited times, as few times as possible, expanding as needed [lazy]
    - matches the characters - literally
    2nd Capturing group (.+)
        .+ matches any character (except newline)
            Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
    http://www.debuggex.com/


The RegEx : "(.+) - (.+?)"
Will match: "Artist - Album" - "T"itle

Explanation: Match lazy non-greedy untill the first hyphen plus one-or-more of any sign.
/(.+) - (.+?)/
    1st Capturing group (.+)
        .+ matches any character (except newline)
            Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
    - matches the characters - literally
    2nd Capturing group (.+?)
        .+? matches any character (except newline)
            Quantifier: Between one and unlimited times, as few times as possible, expanding as needed [lazy]
    http://www.debuggex.com/


The RegEx : "(.+) - (.+) - (.+)"
Will match: "Artist" - "Album" - "Title"
The RegEx : "(.+ - .+) - (.+)"
Will match: "Artist - Album" - "Title"
The RegEx : "(.+ - )(.+ - .+)"
Will match: "Artist" - "Album - Title"
Explanation: Because you  have make clear the delimiters positions.
(not 100% accurate for simpleness)
 



Find more information:
http://www.regular-expressions.info/reference.html
http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
http://www.rexegg.com/
http://www.debuggex.com/
http://regex101.com/



########################## my template for my answers #########################

BEFORE (origin name):
Interpret 2013 - Song title.ext


AFTER (wanted new name):
Interpret - Song title.ext


Rule (what we want in plain english):


SOLUTION (our way to success):


USE (this rules/methods):
RegEx(1)
Search: "(.+) - (.+)"
Replace: "\2 - \1"
"[__] Include Ext." is unchecked.
Don't use the quotes "", they are only there for clarifying where the pattern begins and ends.


INSTRUCTIONS (how to use and which option to set):
= This solution is provide by my tests or assumption based on my experiences in the past.
I can give no guarantee that your computer will not explode and delete all your files.
The solution is based on the provided information and may not work for other file name pattern.
= Remember to test this with some test files first. And always do a backup before you manipulate your important real files!
= Select a few files in the Name column to see what happens in the NewName column.
= Menu "Options > Ignore... > File Extensions" is unchecked.
= My pattern '.ext' stands for any file extension like '.mp3' or '.txt', as that often doesn't matters.
= Sometimes I use the sign '~' instead a real space for better recognizability, like: "Interpret~~-~~Song.mp3" to "Interpret~-~Song.mp3"
= More about RegEx can be found there >>> http://www.bulkrenameutility.co.uk/forum/viewtopic.php?f=3&t=96
(that's: Board index ‹ Bulk Rename Utility ‹ Regular Expressions > "Getting help with Regular Expressions")


EXPLANATION (what have we done here step-by-step?):



HTH? :D

########################## /my template for my answers #########################
.
Stefan
 
Posts: 736
Joined: Fri Mar 11, 2005 7:46 pm
Location: Germany, EU

Re: Getting help with Regular Expressions

Postby andrewscott » Sat Nov 14, 2015 12:27 pm

hi
I am new to regular expressions, and I can't figure out how to rename research papers like "Asia europe journal volume 10 issue 3 1965 T.H.Scott"
to "1965 asia europe journal volume 10 issue 3 T.H.Scott". What i need to do is bring a four digit number to the first while leaving everything else unaltered using regular expressions. Please help.
andrewscott
 
Posts: 1
Joined: Sat Nov 14, 2015 10:54 am

Re: Getting help with Regular Expressions

Postby Admin » Mon Nov 16, 2015 12:38 am

Match : (.*) ([0-9]{4}) (.*)
Replace : \2 \1 \3
Admin
Site Admin
 
Posts: 1505
Joined: Tue Mar 08, 2005 8:39 pm

Re: Getting help with Regular Expressions

Postby tekwrite » Wed Mar 16, 2016 8:47 pm

How do I delete any number of characters between brackets, like [George]
tekwrite
 
Posts: 2
Joined: Wed Mar 16, 2016 8:40 pm

Re: Getting help with Regular Expressions

Postby Admin » Thu Mar 17, 2016 12:00 am

Remove exactly George or G,e,o,r,g,e ?
See Remove (5) in Bulk Rename Utility
Admin
Site Admin
 
Posts: 1505
Joined: Tue Mar 08, 2005 8:39 pm

Re: Getting help with Regular Expressions

Postby Dustydog » Wed Mar 23, 2016 6:35 am

I'd also suggest Notepad++, It's a great free editor that supports Regular Expressions and is a good place to practice - plus, it's a great, great utility: tabbed Notepad at the least, supports various programming languages and more for extra usefulness.
https://notepad-plus-plus.org/
Dustydog
 
Posts: 11
Joined: Wed Mar 23, 2016 3:32 am

Re: Getting help with Regular Expressions

Postby Dustydog » Fri Sep 02, 2016 10:08 am

Or better yet (perhaps, depending), the fellow that makes Regex Buddy, of which I'm becoming inordinately fond, makes a Lite version of his notepad app which is certainly sufficient and supports regexes like mad, ofc., and is tabbed.
https://www.editpadlite.com/download.html

Free with no strings. If you're not a programmer or want to do word processing on it, it's got you covered. Still playing around with it after using Notepad++ for a long time, but I like many things about it - including how closely tied it is into his other products. Full version likely unnecessary.

I've been enjoying using BRU with regexes created with the assistance of RegEx Buddy very much lately - it makes things exceedingly clear and is easy to edit and debug something, for instance if you needed to make a group non-greedy. It also shows the flow of a regex beautifully - it's fascinating.
Dustydog
 
Posts: 11
Joined: Wed Mar 23, 2016 3:32 am

Re: Getting help with Regular Expressions

Postby Admin » Fri Sep 02, 2016 11:48 am

Awesome Regex on GitHub
A curated collection of awesome Regex libraries, tools, frameworks and software:
http://github.com/aloisdg/awesome-regex
Admin
Site Admin
 
Posts: 1505
Joined: Tue Mar 08, 2005 8:39 pm

Re: Getting help with Regular Expressions

Postby Admin » Tue Sep 06, 2016 8:59 am

BRU supports PCRE regular expressions:
http://perldoc.perl.org/perlre.html
Admin
Site Admin
 
Posts: 1505
Joined: Tue Mar 08, 2005 8:39 pm

Re: Getting help with Regular Expressions

Postby Admin » Wed Oct 11, 2017 1:02 am

Two great regular expression testers here:
https://regexr.com/
https://regex101.com/
Admin
Site Admin
 
Posts: 1505
Joined: Tue Mar 08, 2005 8:39 pm


Return to Regular Expressions


cron