Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code.
I keep running into situations where I want to handle special characters such as:
“, ”, ‗, ’, ‖, and …
The problem is that they don't get matched if I use a simple regex like below:
Code:
# Fix quotes mangled by pdftohtml
(re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'),
(re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'),
Next step I tried to use a unicode match:
Code:
(re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'),
But if I attempt that I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII.
I found this page with some info on the error:
http://www.amk.ca/python/howto/unicode
But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions?