Getting help with Regular Expressions

A swapping-ground for Regular Expression syntax

Getting help with Regular Expressions

Postby Admin » Tue Nov 08, 2005 10:49 am

Regular Expressions, or RegExps, are extremely powerful, but can also be extremely complicated.

I have yet to find a problem which a RegExp cannot solve, but you might have to "play around" for a while until you get the solution you desire.

As RegExps have been around for many years there's a llot of information available, in books and online. There's also some nifty utilities lurking. Here's some starters for you

Posted: 19 Mar 2006 03:41 pm Post subject: Use RegExp resource

--------------------------------------------------------------------------------

For those of you who wish to unleash the power of Regular Expressions, but aren't sure what they are all about, then here's a couple of useful resources:

http://www.regular-expressions.info/ (useful notes)
http://laurent.riesterer.free.fr/regexp/ (free utility)
http://www.weitz.de/regex-coach/ (donationware, very good)
http://www.regexbuddy.com/debug.html (paid utility)
http://www.robvanderwoude.com/index.html (resources)

I would strongly recommend the Regex Coach as it gives you an excellent demonstration of the components in your expression. Tip: I find it works best with the "g" box ticked.


If you can't achieve what you are looking for then please post here.


Jim
Admin
Site Admin
 
Posts: 926
Joined: Tue Mar 08, 2005 8:39 pm

Re: Getting help with Regular Expressions

Postby ThanhLoan » Wed Dec 16, 2009 5:43 pm

Hi,
I might be wrong but BRU seems to have max of 9 groups of replacement. When I have more than 9, I can't specify the group .
exp : I can't write \10 ; BRU will take it as group 1 folowed by a '0'

Please advise..

Regards
ThanhLoan
 
Posts: 7
Joined: Fri Dec 11, 2009 8:22 pm

Re: Getting help with Regular Expressions

Postby Stefan » Wed Jan 13, 2010 7:56 pm

ThanhLoan wrote:Hi,
I might be wrong but BRU seems to have max of 9 groups of replacement. When I have more than 9, I can't specify the group .
exp : I can't write \10 ; BRU will take it as group 1 folowed by a '0'

Please advise..

Regards

That's no limitation by BRU but an common Regular Expression standard: you can only access 9 (...)-groups, \1 till \9 (rsp. $1...$9)

(BTW: Sometimes, if supported/implemented, there is also \0 or $0 to get the whole search term back)




Tip:
you only have to put those search term between (...) round brackets, to which you want to back referencing for.
e.g.
FROM:
Bulk Rename Utility
TO:
Bulk Utility
DO:
RegEx(1)
Search: (.+) .+ (.+)
Repla: \1 \2

Here i want only the result of the first and third group back, so i don't have to take the result of the second term into an group too.

Tip:
If one wants to use the (...) round brackets just to have an better overview,
he can use "grouping without backreferencing." in BRU by prefixing the pattern by '?:':
Example:
FROM:
RegEx(1)
Search: (.+) (
?:.+) (.+)
Repla: \1 \2

Please note that here the second pair of round brackets do capturing NOTHING, they just group an search term.

More explanation:
http://www.regular-expressions.info/brackets.html
If you do not use the backreference, you can optimize this regular expression into Set(?:Value)?.
The question mark and the colon after the opening round bracket are the special syntax that you
can use to tell the regex engine that this pair of brackets should not create a backreference.


http://gnosis.cx/publish/programming/regular_expressions.html
Backreferencing in replacement patterns is very powerful; but it is also easy to use more than nine groups in a complex regular expression.
Quite apart from using up the available backreference names, it is often more legible to refer to the parts of a replacement pattern in sequential order.
To handle this issue, some regular expression tools allow "grouping without backreferencing."

A group that should not also be treated as a back reference has a question-mark colon at the beginning of the group,
as in "(?:pattern)." In fact, you can use this syntax even when your backreferences are in the search pattern itself.
Stefan
 
Posts: 508
Joined: Fri Mar 11, 2005 7:46 pm
Location: Germany, EU

Re: Getting help with Regular Expressions

Postby Stefan » Wed Jan 13, 2010 10:22 pm

While on testing i did an another test:
BRU's regex implementation allow "named groups" like (?P<description>pattern)

Example

FROM:
Bulk Rename Utility 2010 Test File.txt
TO:
Bulk Rename Utility Test File.txt
DO:
RegEx(1)
Search: (.+) (\d{4}) (.+)
Repla: \1 \3


Here you can also do
RegEx(1)
Search: (?P<SearchAll_Till>.+) (?P<4_digits>\d{4}) (?P<ThenFindTheRest>.+)
Repla: \1 \3


This is handy for storing the regex with comments -describing what they do- for later re-use.
Stefan
 
Posts: 508
Joined: Fri Mar 11, 2005 7:46 pm
Location: Germany, EU

Re: Getting help with Regular Expressions

Postby Stefan » Tue May 22, 2012 9:56 pm

Since this thread is sticky, i use this to link to other posts:


First, i mention here our two very first posts of this regex sub-forum (at the last page in the meantime):

Getting Started - Overview over the RegEx syntax (Regular expressions, regexes. RegExp, RE)
http://www.bulkrenameutility.co.uk/forum/viewtopic.php?f=3&t=5

Code: Select all
    Regular expressions (regexes) are like super-intelligent wildcards.
   If you learn regexes, you too can become super-intellegent (and have more fun using BRU).

    WILDCARDS
    ---------
    Wildcards are probably familiar to all BRU users.
    You can experiment with wild cards using Windows Explorer:
    Windows Explorer > "Search" dialog box > "Search for files or folders named"
    Summary:
    ? means "any single character"
    * means "zero or more characters"

    ------------------------------------------------------------
    Expression  Matches                      Doesn't match
    ----------  ---------------------------  -------------------
    Notes*      Notes Notes_2005_0302.txt    aNotes
    *Notes*     Notes Notes_2005_0302.txt
    ?Notes*     aNotes aNotes_2005_0302.txt  Notes_2005_0302.txt
    ------------------------------------------------------------

    REGULAR EXPRESSIONS
    -------------------
    Regexes are the same general idea as wildcards, but are considerably more powerful.
    Please glance at the table, then refer to the discussion, below.
    -----------------------------------------------
           Equivalent
    Regex  Wildcard    Matches
    -----  ----------  ----------------------------
    cat    cat         The literal characters "cat"
    .      ?           Any single character
    ..     ??          Any two characters
    ...    ???         Any three characters
    .*     *           Zero or more characters
    ..*    ?*          One or more characters
    ...*   ??*         Two or more characters
    .+     ?*          One or more characters
    ..+    ??*         Two or more characters

    Discussion:
    * means "the preceding character occurs zero or more times"
    + means "the preceding character occurs one  or more times"
    Why would you ever want a character to occur zero or more times? It means that the character is optional. For example: ru.*n matches run, ruin, and ruffian
    On the other hand, ru.+n matches ruin and ruffian, but not run (because we need at least one character between the "u" and "n".)
    Note that there are two equivalent ways of saying "one or more characters": ..* and .+
    -----------------------------------------------

    There are no wildcard equivalents to the following regexes (at least, not in Windows Explorer).
    ----------------------------------------------------------------
    Regex          Matches
    -------------  -------------------------------------------------
    \.             A period.
    \t             A tab character
    \n             A newline character
    ca?t           "c" followed by zero or one  "a", followed by "t"
    ca*t           "c" followed by zero or more "a", followed by "t"
    ca+t           "c" followed by one  or more "a", followed by "t"
    [efgh]         any one of efgh
    [e-h]          any one of efgh
    [a-cF-H]       any one of abcFGH
    [e-h]*         any one of efgh, occurring zero or more times
    [e-h]+         any one of efgh, occurring one or more times
    [a-c][e-h]+    any one of abc; followed by any one of efgh, occurring one or more times
    ([a-c][e-h])+  any one of abcefgh, occurring one or more times.

    Discussion:
    \   \ in front of a regex operator changes it to an ordinary ascii character.
    \   \ Also refers to non-printable ascii characters such as tab and newline (\t and \n).
    ?   means "the preceding character occurs zero or one time"
    ? * + are called "quantifiers" because they specify the number of times a regex expression must occur
    []  always refers to a single character, picked from all those in the square brackets
    ()  parentheses are used for grouping expressions together.
    ()+ means the expression in the parentheses occurs one or more times
    ----------------------------------------------------------------

    BACKREFERENCING!!!!
    -------------------

    In addition to "grouping", there is a second, more powerful use for parentheses, called "backreferencing". The idea is that you can save the matching characters to be used later. For example, suppose you want to change date format from 12-31-2005 to 2005_1231.
    Use this as your "search-regex":
    (12)-(31)-(2005)
    and use this  as your "replace-regex":
    \3_\1\2
    In backreferencing, \1 always refers to the contents of the first pair of parentheses in the search-regex, \2 refers to the contents of the second pair, and \3 to the contents of the third pair.

    Understanding and using backreferencing is essential if you want to take advantage of the powerful regex capability of BRU.

    OTHER PROGRAMS
    --------------
    Here are some programs that can help you get comfortable with regexes, before you start changing your filenames with BRU.

    1. TextPad is shareware with an unlimited trial duration [url]http://www.textpad.com/[/url]
    This is my favorite text editor. The main thing you need to know is that the grouping symbol is \( \) instead of (). Otherwise, regex-gurus-in-training can assume Textpad regexes are identical to BRU.
    To get started:
    - Open a text file
    - Search menu > Find...
    -   Make sure that you've selected the "Regular expressions" check box.
    -   Type in a regex, and click the Find Next button.

    To try out the above backreferencing example:
    In a text file, type 12-31-2005
    - Search menu > Replace...
    -   Make sure that you've selected the "Regular expressions" check box.
    -   Find what: \(12\)-\(31\)-\(2005\)
    -   Replace with: \3_\1\2
    -   Click the "Find Next" button
    -   Click the "Replace" button.


    2. Visual Regex [url]http://laurent.riesterer.free.fr/regexp/[/url]
    Visual Regex is unique because it highlights each regex group () with a different color, then highlights the matching text in the same color. This lets you see what group is matching what text, helps you debug the regex, and helps you learn more about regular expressions.

    3. Regex Buddy [url]http://www.regexbuddy.com/[/url]

    4. Regex Coach [url]http://weitz.de/regex-coach/[/url]

    5. Regex Designer [url]http://www.radsoftware.com.au/regexdesigner/[/url]

    6. The Regulator [url]http://regex.osherove.com/[/url]


Go ahead - Some interesting sides about reg ex,
http://www.bulkrenameutility.co.uk/forum/viewtopic.php?f=3&t=27


Other threads with examples from common interest will follow:




My RegEx hints:
Expression:
. --> one piece of a sign (char, digit, sign, blank)
a --> the char "a" itself literally
abc --> the string "abc" literally itself
(aa|bb) --> one of the alternatives "aa" or "bb", what every is found first
(aa|bb|cc)--> one of the alternatives "aa" or "bb" or "cc", what every is found first
[ab3-] --> one from list ("a" or "b" or "3" or "-") NOTE: the hyphen must be at the very begin or at the end. NOT in between!
[^ab3-] --> one sign but none from the ones from this list (no "a", no "b", no "3" and no hyphen)
[^-] --> one sign (char, digit, whitespace, punctuation) but not a hyphen
[a-z] --> one from the range "a", "b", "c", "d".... till "z"
[a-d] --> one from the range "a", "b", "c" or "d"
[A-Z] --> the same but match upper case letters.
Note: all this A-Za-z thinggy will only match plain 7-bit ASCII chars (english alphabet), no umlauts or ascents or such.
3 --> the digit "3"
2013 --> the number "2013" literally
[6-9] --> one piece of any digit from the range "6", "7", "8" and "9"
[^6-9] --> one piece of any sign (char, digit, whitespace, punctuation) but not a "6", "7", "8" or "9"
\w --> one any letter, digit or underscore
\d --> one any digit from the range "0", "1", "2", "3" till "9"
\s --> one blank
- --> one hyphen literally
_ --> one underscore,
\. --> one dot literally
\\ --> one backslash literally
(...) --> group an expression to apply operators or for backreference. Instead of the three dots write your expression.
(Note: those groups are counted from left to right and can be nested too)
\W, \D, \S, --> opposite of lower case \w \d \s
\W, \D, \S means: match one of ANY sign, but NOT if it is a sign of the character class \w or \d or \s
\W --> match one sign but NOT a word sign, \D --> match one sign but NOT a digit, \S --> match one sign but NOT a whitespace



Note: all of the above match only one single piece of a sign!

To match more than one piece, just double them:
aa --> match two 'a' literally
\d\d\d\d --> match exactly four single digits like '1962' or '2013'
... --> (three dots) match three of any (maybe different too) signs
\s\s --> match two blanks

or use a another meta sign as quantifier.


Quantifier:
* --> match greedy zero or more times the previous expression
+ --> match greedy one or more times the previous expression
{3} --> match exactly 3 times the previous expression
{3,} --> match greedy but at least 3 times the previous expression
{,5} --> match greedy zero-or-more up to 5 times the previous expression
{3,5} --> match greedy 5, or 4, or 3 times the previous expression
? --> behind * or + or {,} will limit the match to as few as (non-greedy)
? --> behind an expression matches on zero or one occurrence
Example:
\d+ --> match one-or-two-or-three-or....-or-as-may-as-possible pieces of any digit. Like '3', or '42', or '123', or '5782332'
\d* --> match zero(none)-or-one-or-two-or....-or-as-many-as-possible pieces of any digit. Like ' ', or '3', or '42', or '123', or '5782332'
\d{4}--> match exactly four of any digits. Like '1962' or '2013' or '1234'
\d{2,4} --> match two, or three, or four of any digits. Like '08' or '2013' or '123'. Works greedy, will get you rather '2013' than '08'
\d{2}|\d{4) --> match exactly two or four digit. But tries to match two first and then stops, even on '2013' it will get you only '20' and will never try to match four digits



Boundaries:
\b --> Match at word boundary. Example: "\bfun\b" on "my fun function" will match 'fun' only.
\A or ^ --> at start of file name. Example: "^fun" on "fun function" will match first 'fun' only.
\Z or $ --> at end of file name. Example: "on$" on "onto my fun function" will match last 'on' only.


Meta signs:
\ --> use the escape character "\" in front of an meta sign, to match an meta sign itself
Meta signs are: ., \, (, ), [, {, }, +, *, ?, |, ^, $
Example:
\. --> one dot literally
\\ --> one backslash literally


backreference on replacement:
\1 - insert here what was matched by first (...)-group
\2 - insert here what was matched by second (...)-group
\3 ... \9 - insert what was matched by third, fourth, fifth,... till ninth group
(Note: some flavours use $1 syntax instead of \1)
(Note: those backreference groups are counted from left to right and can be nested too)
(1 ... (2 ... (3... ))) (4 ... ) (5 ... ) (6 ... (7 ... ) )


NOTE: for BRU, depending on what you want to do, you have to match mostly the whole file name, not only the part you are interested in.
Example: "Interpret 2013 - Song title.mp3"
Right way:
Match: "(.+) \d\d\d\d - (.+)"
Replace: "\1 \2"
Wrong way:
Match: "\d\d\d\d"
Replace: ""


Greed, greedy, OR non-greedy, reluctant:
By default, *, ?, +, and {min,max} are greedy because they consume all characters up through the last possible one that still satisfies the entire pattern.
To instead have them stop at the first possible character, follow them with a question mark. For example, the pattern <.+> (which lacks a question mark)
means: "search for a <, followed by one or more of any character, followed by a >". To stop this pattern from matching the entire string <em>text</em>,
append a question mark to the plus sign: <.+?>. This causes the match to stop at the first '>' and thus it matches only the first tag <em>

Example:
".+(\d\d)" on "Album1987" will gets you "87", because ".+" is greedy and matches "19" too.
".+?(\d\d)" on "Album1987" will gets you "19", because the '?' on ".+?" makes that expression non-greedy and matches only till it find two digits firstly.


Find more information:
http://www.regular-expressions.info/reference.html
http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm




########################## my template for my answers #########################

BEFORE (origin name):
Interpret 2013 - Song title.ext


AFTER (wanted new name):
Interpret - Song title.ext


SOLUTION (our way to success):


USE (this rules from BRU):
RegEx(1)
Search: "(.+) - (.+)"
Replace: "\2 - \1"
"[__] Include Ext." is unchecked.
Don't use the quotes "", they are only there for clarifying where the pattern begins and ends.


INSTRUCTIONS (how to use and which option to set):
= This solution is provide by my tests or assumption based on my experiences in the past.
I can give no guarantee that your computer will not explode and delete all your files.
The solution is based on the provided information and may not work for other file name pattern.
= Remember to test this with some test files first. And always do a backup before you manipulate your important real files!
= Select a few files in the Name column to see what happens in the NewName column.
= Menu "Options > Ignore... > File Extensions" is unchecked.
= My pattern '.ext' stands for any file extension like '.mp3' or '.txt', as that often doesn't matters.
= Sometimes I use the sign '~' instead a real space for better recognizability, like: "Interpret~~-~~Song.mp3" to "Interpret~-~Song.mp3"
= More about RegEx can be found there >>> http://www.bulkrenameutility.co.uk/forum/viewtopic.php?f=3&t=96
(that's: Board index ‹ Bulk Rename Utility ‹ Regular Expressions > "Getting help with Regular Expressions")


EXPLANATION (what have we done here step-by-step?):



HTH? :D

########################## /my template for my answers #########################
.
Stefan
 
Posts: 508
Joined: Fri Mar 11, 2005 7:46 pm
Location: Germany, EU


Return to Regular Expressions