MobileRead Forums - View Single Post

macr0t0r · 12-02-2009, 06:37 PM

Quote:

Originally Posted by user_none

A simpler way to do this is:

Code:

text = re.sub('[^\x00-\x7f]', lambda x: '\\U%04x' % ord(x.group()), text)

Hmmm....I like that it's small and fast, but it has a couple issues. First off, the \U tag has proven to be unreliable with some fonts, and it's a train-wreck on Symbian devices. Also, I prefer how the "\a000" translates directly to "& #000;" in HTML (space to prevent htmlizing).

Second, I don't believe there are extended codes for \x80 and \x81. However, this is a fascinating little trick. Perhaps this could work?

Code:

text = re.sub('[\x82-\xff]', lambda x: '\\a%03d' % ord(x.group()), text)

Then, perhaps I could fall back to unicode for whatever is left:

Code:

text = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), text)

- Jim