Quote:
Originally Posted by DiapDealer
\b doesn't really "match" any characters—or more technically, its match is zero-length. It matches word boundaries. Which can be:
* Before the first character in the string, if the first character is a word character.
* After the last character in the string, if the last character is a word character.
* Between two characters in the string, where one is a word character and the other is not a word character.
A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w
"There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another.
What are you wishing ’\b would find?
|
Again, thanks for the tutorial. Why is it that when an MR poster explains something it makes complete sense, but when i try to read an official Reg Ex tutorial i actually feel my brain cells dying and my life expectancy withering?
The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like
’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like
(?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does.