MobileRead Forums - View Single Post

capnm · 12-01-2011, 07:33 PM

Quote:

Originally Posted by ldolse

I think re.UNICODE only causes \w & \W to match non-ascii characters, at least practically speaking. Which would be okay except that \w also includes numbers - if you're okay with matching numbers then \w+ should be ok.

I've always wished it would make [a-zA-Z] work the way capnm wants. I suppose you might be able to mix it with an digit exclusion lookahead:

(?u)(?=[^\d]+)(\w+)

But it's going to get tricky.

And what I really want is some form of (?u)[a-z] or (?u)[A-Z] to work, but I think I'm out of luck on that one.

I played/poked around a bit and here's what I found (which may even be correct):

This flavor of python regex supports (?u), which makes \w, \d, \b unicode aware.
It doesn't support \unnnn or \Unnnnnnnn.
It doesn't support upper/lower properties or character classes.

Revising your lookahead idea, I think this will emulate a unicode aware [a-zA-Z]

(?u)\w(?!(?<=[\d_]))

but that doesn't solve my wish ...

Oh, well. This was supposed to be a quick exercise in tweaking some template code. Now I'm just being stubborn

Since I don't forsee any great inspiration on how to make a unicode [a-z], I'll probably settle for adding [à-ÿ] to at least make it Latin-1 aware ...