Quote:
Originally Posted by user_none
A simpler way to do this is:
Code:
text = re.sub('[^\x00-\x7f]', lambda x: '\\U%04x' % ord(x.group()), text)
|
Hmmm....I like that it's small and fast, but it has a couple issues. First off, the \U tag has proven to be unreliable with some fonts, and it's a train-wreck on Symbian devices. Also, I prefer how the "\a000" translates directly to "& #000;" in HTML (space to prevent htmlizing).
Second, I don't believe there are extended codes for \x80 and \x81. However, this is a fascinating little trick. Perhaps this could work?
Code:
text = re.sub('[\x82-\xff]', lambda x: '\\a%03d' % ord(x.group()), text)
Then, perhaps I could fall back to unicode for whatever is left:
Code:
text = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), text)
- Jim