View Single Post
Old 12-02-2009, 06:37 PM   #57
macr0t0r
Connoisseur
macr0t0r doesn't littermacr0t0r doesn't litter
 
macr0t0r's Avatar
 
Posts: 91
Karma: 108
Join Date: Jan 2008
Device: Palm Treo 680, Sony Reader
Quote:
Originally Posted by user_none View Post
A simpler way to do this is:

Code:
text = re.sub('[^\x00-\x7f]', lambda x: '\\U%04x' % ord(x.group()), text)
Hmmm....I like that it's small and fast, but it has a couple issues. First off, the \U tag has proven to be unreliable with some fonts, and it's a train-wreck on Symbian devices. Also, I prefer how the "\a000" translates directly to "& #000;" in HTML (space to prevent htmlizing).

Second, I don't believe there are extended codes for \x80 and \x81. However, this is a fascinating little trick. Perhaps this could work?
Code:
text = re.sub('[\x82-\xff]', lambda x: '\\a%03d' % ord(x.group()), text)
Then, perhaps I could fall back to unicode for whatever is left:
Code:
text = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), text)
- Jim

Last edited by macr0t0r; 12-02-2009 at 06:40 PM.
macr0t0r is offline   Reply With Quote