Thread: RegEx & Unicode
View Single Post
Old 12-01-2011, 06:33 PM   #12
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
Quote:
Originally Posted by ldolse View Post
I think re.UNICODE only causes \w & \W to match non-ascii characters, at least practically speaking. Which would be okay except that \w also includes numbers - if you're okay with matching numbers then \w+ should be ok.

I've always wished it would make [a-zA-Z] work the way capnm wants. I suppose you might be able to mix it with an digit exclusion lookahead:

(?u)(?=[^\d]+)(\w+)

But it's going to get tricky.
And what I really want is some form of (?u)[a-z] or (?u)[A-Z] to work, but I think I'm out of luck on that one.

I played/poked around a bit and here's what I found (which may even be correct):

This flavor of python regex supports (?u), which makes \w, \d, \b unicode aware.
It doesn't support \unnnn or \Unnnnnnnn.
It doesn't support upper/lower properties or character classes.


Revising your lookahead idea, I think this will emulate a unicode aware [a-zA-Z]

(?u)\w(?!(?<=[\d_]))

but that doesn't solve my wish ...

Oh, well. This was supposed to be a quick exercise in tweaking some template code. Now I'm just being stubborn

Since I don't forsee any great inspiration on how to make a unicode [a-z], I'll probably settle for adding [à-ÿ] to at least make it Latin-1 aware ...
capnm is offline   Reply With Quote