Regexes to improve pdf to epub conversion - Page 2

ldolse · 04-13-2009, 12:15 AM

Just found the solution to the line breaks, easier to do with two regexes. Here's the code for the area I changed:

Code:

def wrap_lines(match):
    ital = match.group('ital')
    if not ital: 
               return ' '
    else: 
               return ital+' '

def chap_head(match):
    chap = match.group('chap')
    title = match.group('title')
    if not title: 
               return '<h1>'+chap+'</h1><br/>'
    else: 
               return '<h1>'+chap+'<br/>'+title+'</h1><br/>'
    
    
class PreProcessor(object):
    PREPROCESS = [
                  # Some idiotic HTML generators (Frontpage I'm looking at you)
                  # Put all sorts of crap into <head>. This messes up lxml
                  (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), 
                   sanitize_head),
                  # Convert all entities, since lxml doesn't handle them well
                  (re.compile(r'&(\S+?);'), convert_entities),
                  # Remove the <![if/endif tags inserted by everybody's darling, MS Word
                  (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), 
                   lambda match: ''),
                  ]
                     
    # Fix pdftohtml markup
    PDFTOHTML  = [
                  # Remove <hr> tags
                  (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'),
                  
                  # Remove page links
                  (re.compile(r'<a name=\d+></a>', re.IGNORECASE), lambda match: ''),
                  
                  # Remove page numbers
                  (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''),
                  # Remove <br> and replace <br><br> with <p>
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'),
                  (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),
                  
                  # un-wrap wrapped lines - uses two regexes
                  (re.compile(r'(?<=[a-z,I])\s*(?P<ital></i>)?\s*(</p><p>)?\n\r?\s?(?=(<i>)?\w)', re.DOTALL), wrap_lines),
                  (re.compile(r'(?<=.{85}[a-z,I])\s*(<p[^>]*>|<br[^>]*>)\s*(?=\w)', re.UNICODE), lambda match: ' '),
           
                  # Add space before italics
                  (re.compile(r'<i>'), lambda match: '<i> '),
                  
                  # Remove hyphenation
                  (re.compile(r'-\n\r?'), lambda match: ''),
                  
                  # Remove gray background
                  (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
                  
                  # Remove non breaking spaces
                  (re.compile(ur'\u00a0'), lambda match : ' '),
                  
                  # Detect Chapters
                  (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head),
                  
                  # Have paragraphs show better
                  (re.compile(r'(?<=.{85})<br[^>]*>\n'), lambda match : '<p>\n'),

                  # terminate unterminated lines.
                  (re.compile(r'(?<!>)\s*\n'), lambda match : '<br/>\n'),
                  (re.compile(r'</i>\s*\n'), lambda match : '</i><br/>\n'),
                  
                  ]

I've just been testing with a single book, will test across some more to see if there are any problems.

ldolse · 04-13-2009, 06:06 AM

Sorry to keep posting changes. Turns out that this regexp:

Code:

                  # Remove <br> and replace <br><br> with <p>
                  (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),

Was causing most of the problems with lines wrapping in the final ebook, and the regexes I was struggling to create basically just put those back. I suspect based on the logic of length < 40 that this was a simple attempt to stop wrapped lines from being forced to wrap, so the line un-wrapping regexp replaces this function anyway.

I've also modified page link stripping regexp further to completely get rid of the breaks.

Here are the latest changes, and this creates the cleanest looking results yet:

Code:

def sanitize_head(match):
    x = match.group(1)
    x = _span_pat.sub('', x)
    return '<head>\n'+x+'\n</head>'

def wrap_lines(match):
    ital = match.group('ital')
    if not ital: 
               return ' '
    else: 
               return ital+' '

def chap_head(match):
    chap = match.group('chap')
    title = match.group('title')
    if not title: 
               return '<h1>'+chap+'</h1><br/>'
    else: 
               return '<h1>'+chap+'<br/>'+title+'</h1><br/>'
    
    
class PreProcessor(object):
    PREPROCESS = [
                  # Some idiotic HTML generators (Frontpage I'm looking at you)
                  # Put all sorts of crap into <head>. This messes up lxml
                  (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), 
                   sanitize_head),
                  # Convert all entities, since lxml doesn't handle them well
                  (re.compile(r'&(\S+?);'), convert_entities),
                  # Remove the <![if/endif tags inserted by everybody's darling, MS Word
                  (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), 
                   lambda match: ''),
                  ]
                     
    # Fix pdftohtml markup
    PDFTOHTML  = [
                  # Remove <hr> tags
                  (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'),
                  
                  # Remove page breaks & links
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>\s*\n?\s*<a name=\d+></a>', re.IGNORECASE), lambda match: '<br>\n'),
                  
                  # Remove page numbers
                  (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''),

                  # Replace <br><br> with <p>
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'),
                  
                  # un-wrap wrapped lines
                  (re.compile(r'(?<=.{85}[a-z,IA])\s*(?P<ital></i>)?\s*(<p[^>]*>|<br[^>]*>)?\n\r?\s?(?=(<i>)?\w)', re.UNICODE), wrap_lines),
           
                  # Add space before and after italics
                  (re.compile(r'(?<!“)<i>'), lambda match: ' <i>'),
                  (re.compile(r'</i>(?=\w)'), lambda match: '</i> '),
                  
                  # Remove hyphenation
                  (re.compile(r'-\n\r?'), lambda match: ''),
                  
                  # Remove gray background
                  (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
                  
                  # Remove non breaking spaces
                  (re.compile(ur'\u00a0'), lambda match : ' '),
                  
                  # Detect Chapters to match default XPATH in GUI
                  (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head),
                  
                  # Have paragraphs show better
                  (re.compile(r'(?<=.{100})<br[^>]*>\n'), lambda match : '<p>\n'),
                  
                  ]

user_none · 04-13-2009, 07:56 AM

Quote:

Originally Posted by ldolse

That was why I originally added the check for 85 characters to the second expression. That way only lines which are approaching the length of the entire page are wrapped, as they would likely have been wrapped anyway.

The only thing I worry with this is it assumes the source pdf is an 8.5x11 page and the font is a certain size (12 point I'm assuming). Do you have any ideas for cases such as a pdf sized for an e-reader? The Cybook's page size would be 3.5x4.7 inches.

Quote:

Originally Posted by ldolse

Here are the latest changes, and this creates the cleanest looking results yet

I'll merge them soon.

ldolse · 04-13-2009, 08:06 AM

We could make the 85 character check smaller. I'll try to find some pdfs with smaller page sizes and see where they break.

I just added another tweak to that last post to fix the spacing for italicized text. Note that '“' doesn't match for some reason, but haven't had time to figure that one out.

user_none · 04-13-2009, 09:18 AM

Quote:

Originally Posted by ldolse

We could make the 85 character check smaller. I'll try to find some pdfs with smaller page sizes and see where they break.

I can generate a number of PDFs with varying fonts size for different device profiles. One of the things I added to pluginize is PDF output. I'll do this later today when I get home from work. I can either post them here or send them to you directly.

kovidgoyal · 04-13-2009, 01:04 PM

the correct way to this would be to calculate average line length

user_none · 04-15-2009, 09:35 AM

@ldolse, I've modified the it to use a line length algorithm. It uses median length omitting 0 length lines and removes outliers using a double of the average. I haven't tested the results yet because PDF input in pluginze is a bit broken at the moment. I'll let you know how it works once I get that sorted out.

user_none · 04-15-2009, 09:57 PM

While testing I've run into two issues. The regex's are not properly enclosing short lines that should not be wrapped in <p></p> tags. It's also not wrapping any lines for the test pdfs I'm giving it. emailing me directly (john@nachtimwald.com) would probably speed up getting the rules working.

ldolse · 04-22-2009, 05:00 AM

Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code.

I keep running into situations where I want to handle special characters such as:
“, ”, ‗, ’, ‖, and …

The problem is that they don't get matched if I use a simple regex like below:

Code:

                  # Fix quotes mangled by pdftohtml
                  (re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'), 
                  (re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'),

Next step I tried to use a unicode match:

Code:

 
 (re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'),

But if I attempt that I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII.

I found this page with some info on the error:
http://www.amk.ca/python/howto/unicode

But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions?

04-22-2009, 05:00 AM	#24
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Matching special punctuation, etc Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code. I keep running into situations where I want to handle special characters such as: “, ”, ‗, ’, ‖, and … The problem is that they don't get matched if I use a simple regex like below: Code: # Fix quotes mangled by pdftohtml (re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'), (re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'), Next step I tried to use a unicode match: Code: (re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'), But if I attempt that I get this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII. I found this page with some info on the error: http://www.amk.ca/python/howto/unicode But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions? Last edited by ldolse; 04-22-2009 at 06:37 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Pdf to Epub conversion - Newby	zambosky	Calibre	5	06-22-2010 05:12 AM
PDF to EPUB conversion	jfontana	Calibre	2	03-17-2010 04:09 AM
Help Needed for PDF to Epub Conversion	saurabh Morankar	ePub	9	12-04-2009 06:10 PM
pdf to epub conversion	mediax	Sigil	16	11-19-2009 04:48 PM
Help with conversion from PDF to EPUB	Fizz	Calibre	5	10-25-2009 12:48 PM

04-13-2009, 08:06 AM	#19
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	We could make the 85 character check smaller. I'll try to find some pdfs with smaller page sizes and see where they break. I just added another tweak to that last post to fix the spacing for italicized text. Note that '“' doesn't match for some reason, but haven't had time to figure that one out.

04-13-2009, 01:04 PM	#21
kovidgoyal creator of calibre Posts: 45,627 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	the correct way to this would be to calculate average line length

04-15-2009, 09:35 AM	#22
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	@ldolse, I've modified the it to use a line length algorithm. It uses median length omitting 0 length lines and removes outliers using a double of the average. I haven't tested the results yet because PDF input in pluginze is a bit broken at the moment. I'll let you know how it works once I get that sorted out.

04-15-2009, 09:57 PM	#23
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	While testing I've run into two issues. The regex's are not properly enclosing short lines that should not be wrapped in <p></p> tags. It's also not wrapping any lines for the test pdfs I'm giving it. emailing me directly (john@nachtimwald.com) would probably speed up getting the rules working.

Advert

Advert