MobileRead Forums - View Single Post - Regexes to improve pdf to epub conversion

ldolse · 04-22-2009, 04:00 AM

Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code.

I keep running into situations where I want to handle special characters such as:
“, ”, ‗, ’, ‖, and …

The problem is that they don't get matched if I use a simple regex like below:

Code:

                  # Fix quotes mangled by pdftohtml
                  (re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'), 
                  (re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'),

Next step I tried to use a unicode match:

Code:

 
 (re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'),

But if I attempt that I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII.

I found this page with some info on the error:
http://www.amk.ca/python/howto/unicode

But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions?

04-22-2009, 04:00 AM	#24
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Matching special punctuation, etc Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code. I keep running into situations where I want to handle special characters such as: “, ”, ‗, ’, ‖, and … The problem is that they don't get matched if I use a simple regex like below: Code: # Fix quotes mangled by pdftohtml (re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'), (re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'), Next step I tried to use a unicode match: Code: (re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'), But if I attempt that I get this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII. I found this page with some info on the error: http://www.amk.ca/python/howto/unicode But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions? Last edited by ldolse; 04-22-2009 at 05:37 AM.