View Single Post
Old 04-22-2009, 04:00 AM   #24
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Matching special punctuation, etc

Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code.

I keep running into situations where I want to handle special characters such as:
“, ”, ‗, ’, ‖, and …

The problem is that they don't get matched if I use a simple regex like below:

Code:
                  # Fix quotes mangled by pdftohtml
                  (re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'), 
                  (re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'),
Next step I tried to use a unicode match:
Code:
 
 (re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'),
But if I attempt that I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII.

I found this page with some info on the error:
http://www.amk.ca/python/howto/unicode

But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions?

Last edited by ldolse; 04-22-2009 at 05:37 AM.
ldolse is offline   Reply With Quote