Thread: RegEx & Unicode
View Single Post
Old 12-01-2011, 07:17 PM   #14
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by capnm View Post
But I'm curious -- why the leading (?:^|\s+) instead of \s* is there a functional difference?
If you're not using the unicode support or don't have the locale flag set, you will end up with some non-whitespace characters(also punctuation you want to avoid) being seen as a break in a word; If you were to use \s*, this would then mean that the next letter - which has the possibility of being in the middle of a word, will be used as an initial.

By specifying that the starting point either has to be the start of a string (careful of multiline issues), this situation is removed as the word can only be separated by one or more spaces.

If you want to use it for replacement - as you wanted, the pattern would be :
Code:
find: (?iu)(?:^|\s+)((?:\d+\.?\d*?)|(?:[\D]))[\w]+
replace: \1
Tho it then uses the unicode flag, a trade off between being robust and easily matching things.

Last edited by Serpentine; 12-01-2011 at 07:43 PM.
Serpentine is offline   Reply With Quote