MobileRead Forums - View Single Post

ElMiko · 06-14-2012, 12:16 PM

Quote:

Originally Posted by DiapDealer

\b doesn't really "match" any characters—or more technically, its match is zero-length. It matches word boundaries. Which can be:

* Before the first character in the string, if the first character is a word character.
* After the last character in the string, if the last character is a word character.
* Between two characters in the string, where one is a word character and the other is not a word character.

A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w

"There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another.

What are you wishing ’\b would find?

Again, thanks for the tutorial. Why is it that when an MR poster explains something it makes complete sense, but when i try to read an official Reg Ex tutorial i actually feel my brain cells dying and my life expectancy withering?

The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does.