MobileRead Forums - View Single Post - Regexes to improve pdf to epub conversion

ldolse · 04-12-2009, 10:19 PM

Cool, found the changes now. I've just started using look-around functions, so still getting used to the use cases. I tried out your modifications, and I'm seeing a couple things, one is that in many cases lines aren't getting un-wrapped, and the other is that lines which shouldn't be wrapped are now being wrapped.

Many lines don't get wrapped because not every line feed is accompanied by <br> or <p>.

The secondary issue is lines that are getting wrapped that shouldn't. This primarily affects poetry, as poetry lines are relatively short and lack punctuation. I can also affect conversations, depending on how they are formatted. That was why I originally added the check for 85 characters to the second expression. That way only lines which are approaching the length of the entire page are wrapped, as they would likely have been wrapped anyway.

I've merged your optimizations with mine, here's my complete PDFTOHTML function.

Code:

def wrap_lines(match):
    ital = match.group('ital')
    if not ital: 
               return ' '
    else: 
               return ital+' '

def chap_head(match):
    chap = match.group('chap')
    title = match.group('title')
    if not title: 
               return '<h1>'+chap+'</h1><br/>'
    else: 
               return '<h1>'+chap+'<br/>'+title+'</h1><br/>'
    
    
class PreProcessor(object):
    PREPROCESS = [
                  # Some idiotic HTML generators (Frontpage I'm looking at you)
                  # Put all sorts of crap into <head>. This messes up lxml
                  (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), 
                   sanitize_head),
                  # Convert all entities, since lxml doesn't handle them well
                  (re.compile(r'&(\S+?);'), convert_entities),
                  # Remove the <![if/endif tags inserted by everybody's darling, MS Word
                  (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), 
                   lambda match: ''),
                  ]
                                          
    # Fix pdftohtml markup
    PDFTOHTML  = [
                  # Remove <hr> tags
                  (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'),
                  
                  # Remove page links
                  (re.compile(r'<a name=\d+></a>', re.IGNORECASE), lambda match: ''),
                  
                  # Remove page numbers
                  (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''),
                  # Remove <br> and replace <br><br> with <p>
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'),
                  (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),
                  
                  # un-wrap wrapped lines - uses two regexes
                  (re.compile(r'(?<=[a-z,I])\s*(?P<ital></i>)?\s*(</p><p>)?\n\r?\s?(?=(<i>)?\w)', re.DOTALL), wrap_lines),
                  (re.compile(r'(?<=.{85}[a-z,I])\s*(<p[^>]*>|<br[^>]*>)\s*(?=\w)', re.UNICODE), lambda match: ' '),
           
                  # Add space before italics
                  (re.compile(r'<i>'), lambda match: '<i> '),
                  
                  # Remove hyphenation
                  (re.compile(r'-\n\r?'), lambda match: ''),
                  
                  # Remove gray background
                  (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
                  
                  # Remove non breaking spaces
                  (re.compile(ur'\u00a0'), lambda match : ' '),
                  
                  # Detect Chapters
                  (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head),
                  
                  # Have paragraphs show better
                  (re.compile(r'(?<=.{85})<br[^>]*>$'), lambda match : '<p>'),

                  # terminate unterminated lines.
                  # (re.compile(r'(?<!r/>|<p>|/p>|ad>|dy>|ml>)\n'), lambda match : '<br/>\n'),
                  (re.compile(r'(\.|”|\")\s*\n'), lambda match : match.group(1)+'<br/>\n'),
                  
                  ]

Other changes included here:

Modified the page number stripping regexp so that it doesn't strip chapter numbers
Added a Chapter Detection regexp so the XPATH in the GUI will find chapters surrounded by <h1></h1>
Adding a space next to occurrences of <i> so that italics don't run into neighboring words.
Changed uses of .*? in tags to [^>]*
Attempting to add <br/> to unterminated lines

I've been struggling a bit getting the regexes right, as there is another stage of processing that the html goes through somewhere(lxml??), so I'm working a bit blindly. Chapters are represented very differently at the time the HTML hits the PDFTOHTML function compared to the final output. Also, for some reason adding the chapter detection regexp causes <p></p> to come out differently in the final version, with margin and border settings all to 0 points.

Lastly, I can't seem to get the regex to terminate line endings working. This is the regex I'd like to use:

Code:

(re.compile(r'(?<!r/>|<p>|/p>|ad>|dy>|ml>)\n'), lambda match : '<br/>\n'),

It works perfectly when modifying the final output embedded in the epub, but it causes some corruption when used in PDFTOHTML which prevents conversion to epub.

I'm using this one as an alternate:

Code:

(re.compile(r'(\.|”|\")\s*\n'), lambda match : match.group(1)+'<br/>\n'),

But it only gets 75% of the unterminated lines, and I can't figure out why.