MobileRead Forums - View Single Post

Serpentine · 12-01-2011, 07:17 PM

Quote:

Originally Posted by capnm

But I'm curious -- why the leading (?:^|\s+) instead of \s* is there a functional difference?

If you're not using the unicode support or don't have the locale flag set, you will end up with some non-whitespace characters(also punctuation you want to avoid) being seen as a break in a word; If you were to use \s*, this would then mean that the next letter - which has the possibility of being in the middle of a word, will be used as an initial.

By specifying that the starting point either has to be the start of a string (careful of multiline issues), this situation is removed as the word can only be separated by one or more spaces.

If you want to use it for replacement - as you wanted, the pattern would be :

Code:

find: (?iu)(?:^|\s+)((?:\d+\.?\d*?)|(?:[\D]))[\w]+
replace: \1

Tho it then uses the unicode flag, a trade off between being robust and easily matching things.