Yes, the word boundaries and word characters I really did not completely understand until conducting this experiment.
Its very unfortunate that the regex administrators decided to make the underscore a word character, but
not the apostrophe?
To me, this seems like it should be the opposite, so I really do hope they will soon reconsider this decision.
I can understand that "contraction" says "joined words", but the contraction is still only just one word.
And also some languages can shorten long words with apostrophe like Möchengladbach ===> M'gladbach.
But I did
never find any languages that use the underscore to spell their words anywhere!
It almost seems like the aliens studied our language to invent the regex, but then made mistakes with the underscore and apostrophe.
So I spent many searches on the internet, looking for groups like (?\w:') or (?\w:!_) to say things like...
"Grant the apostrophe as a word-character" -or- "Forbid the underscore as a word-character".
But everyone was saying lookarounds to be the only way, so my first experiments took a very long time.
The problem was that Im never wanting to alter anything inside of
(boundary)(RomanGroups)(boundary).
So instead, all of my experiments were looking like... (Look
ahead)
(boundary)(RomanGroups)(boundary)(Look
behind).
The look
ahead conducted properly with (?!_*[c-x]+'\w) to say 'apostrophe cant touch a word-character on the right'.
But when Im trying a look
behind with (?<!\w'[c-x]
+) to say 'apostrophe cant touch a word character on the left, it always failed.
The regex would always report that using
+ was invalid, and then it would destroy my whole entire expression!
(The [c-x] could have been \w, but Im just using c-x to present for the eyesight where Im matching the Romans).
So dont ask me why, but the lookbehind only wants to look-behind one exact number of characters (never 1-or-2, or 1-or-more, etc).
Im even experimented with things like {1,15} to say 1-through-15, but it did always forbid ranges, and only granted one exact number.
Im not found anything about this looking in the manuals, so maybe its the regex administrators, or maybe its BRU limitation??
This makes the experiments very difficult, because instead of just using one lookbehind, Im using very many like...
(?<!\w'[c-x])(?<!\w'[c-x]
{2})(?<!\w'[c-x]
{3})(?<!\w'[c-x]
{4})(?<!\w'[c-x]
{5})(?<!\w'[c-x]
{6})(?<!\w'[c-x]
{7})(?<!\w'[c-x]
{8})
Except Im going all the way to
{15}, just in case of needing to match the very longest possible Roman-Numeral.
Then Im trying to add exceptions, but its getting very much longer and complicated, so then Im asking this question...
viewtopic.php?f=4&t=5494Finally, I learned to stop using (Lookahead)
(boundary)(RomanGroups)(boundary)(Lookbehind), because both lookarounds must again match the Romans!
But with
(boundary)(Look
behind)
(RomanGroups)(Look
ahead)
(boundary), now the lookarounds only worry about matching outside the boundaries.
Hopefully this logic can help other users, because it was the hardest part for me to understand, about where to best settle your lookarounds in the expression.
And especially how look
behinds will forbid looking-behind more than "one exact number of characters", so always forbidding things like
*,
+, or {1
,15}.
Also, you can change both \w ===> [a-z\d] to forbid matching underscores, because
\w conducts like ...iii _'iii'_ _'iii' 'iii'_ a'iii'_ iii ===========>
III _'iii'_ _'iii' 'iii'_ a'iii'_
III .................(because \w matching underscore as part of a word).
SomeUnicode'iii iii'SomeUnicode iii ===> SomeUnicode'iii iii'SomeUnicode
III .......(because \w matching unicode word-characters)
I prefer using \w like above, because [a-z\d] cant match unicodes, and
[a-z\d] conducts like ...iii _'iii'_ _'iii' 'iii'_ a'iii'_ iii ==========>
III _'
III'_ _'
III' '
III'_ a'iii'_
III ...................(because [a-z\d] not matching _ as word-character)
SomeUnicode'iii iii'SomeUnicode iii ===> SomeUnicode'
III III'SomeUnicode
III ....... (because [a-z\d] not matching unicodes!)
So for me, Im preferring
\w for a regex to conduct "not enough", instead of [a-z\d] that conducts "too much".
And you could always add another (?X)expression to fix the underscores, but for now Im still studying the unicode characters.
Im hoping to discover a way to match unicodes inside of [a-z\d
here], so then everything can be inside of just one expression??