Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 04-12-2009, 11:15 PM   #16
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Just found the solution to the line breaks, easier to do with two regexes. Here's the code for the area I changed:

Code:
def wrap_lines(match):
    ital = match.group('ital')
    if not ital: 
               return ' '
    else: 
               return ital+' '

def chap_head(match):
    chap = match.group('chap')
    title = match.group('title')
    if not title: 
               return '<h1>'+chap+'</h1><br/>'
    else: 
               return '<h1>'+chap+'<br/>'+title+'</h1><br/>'
    
    
class PreProcessor(object):
    PREPROCESS = [
                  # Some idiotic HTML generators (Frontpage I'm looking at you)
                  # Put all sorts of crap into <head>. This messes up lxml
                  (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), 
                   sanitize_head),
                  # Convert all entities, since lxml doesn't handle them well
                  (re.compile(r'&(\S+?);'), convert_entities),
                  # Remove the <![if/endif tags inserted by everybody's darling, MS Word
                  (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), 
                   lambda match: ''),
                  ]
                     
    # Fix pdftohtml markup
    PDFTOHTML  = [
                  # Remove <hr> tags
                  (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'),
                  
                  # Remove page links
                  (re.compile(r'<a name=\d+></a>', re.IGNORECASE), lambda match: ''),
                  
                  # Remove page numbers
                  (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''),
                  # Remove <br> and replace <br><br> with <p>
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'),
                  (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),
                  
                  # un-wrap wrapped lines - uses two regexes
                  (re.compile(r'(?<=[a-z,I])\s*(?P<ital></i>)?\s*(</p><p>)?\n\r?\s?(?=(<i>)?\w)', re.DOTALL), wrap_lines),
                  (re.compile(r'(?<=.{85}[a-z,I])\s*(<p[^>]*>|<br[^>]*>)\s*(?=\w)', re.UNICODE), lambda match: ' '),
           
                  # Add space before italics
                  (re.compile(r'<i>'), lambda match: '<i> '),
                  
                  # Remove hyphenation
                  (re.compile(r'-\n\r?'), lambda match: ''),
                  
                  # Remove gray background
                  (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
                  
                  # Remove non breaking spaces
                  (re.compile(ur'\u00a0'), lambda match : ' '),
                  
                  # Detect Chapters
                  (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head),
                  
                  # Have paragraphs show better
                  (re.compile(r'(?<=.{85})<br[^>]*>\n'), lambda match : '<p>\n'),

                  # terminate unterminated lines.
                  (re.compile(r'(?<!>)\s*\n'), lambda match : '<br/>\n'),
                  (re.compile(r'</i>\s*\n'), lambda match : '</i><br/>\n'),
                  
                  ]
I've just been testing with a single book, will test across some more to see if there are any problems.
ldolse is offline   Reply With Quote
Old 04-13-2009, 05:06 AM   #17
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Sorry to keep posting changes. Turns out that this regexp:
Code:
                  # Remove <br> and replace <br><br> with <p>
                  (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),
Was causing most of the problems with lines wrapping in the final ebook, and the regexes I was struggling to create basically just put those back. I suspect based on the logic of length < 40 that this was a simple attempt to stop wrapped lines from being forced to wrap, so the line un-wrapping regexp replaces this function anyway.

I've also modified page link stripping regexp further to completely get rid of the breaks.

Here are the latest changes, and this creates the cleanest looking results yet:
Code:
def sanitize_head(match):
    x = match.group(1)
    x = _span_pat.sub('', x)
    return '<head>\n'+x+'\n</head>'

def wrap_lines(match):
    ital = match.group('ital')
    if not ital: 
               return ' '
    else: 
               return ital+' '

def chap_head(match):
    chap = match.group('chap')
    title = match.group('title')
    if not title: 
               return '<h1>'+chap+'</h1><br/>'
    else: 
               return '<h1>'+chap+'<br/>'+title+'</h1><br/>'
    
    
class PreProcessor(object):
    PREPROCESS = [
                  # Some idiotic HTML generators (Frontpage I'm looking at you)
                  # Put all sorts of crap into <head>. This messes up lxml
                  (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), 
                   sanitize_head),
                  # Convert all entities, since lxml doesn't handle them well
                  (re.compile(r'&(\S+?);'), convert_entities),
                  # Remove the <![if/endif tags inserted by everybody's darling, MS Word
                  (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), 
                   lambda match: ''),
                  ]
                     
    # Fix pdftohtml markup
    PDFTOHTML  = [
                  # Remove <hr> tags
                  (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'),
                  
                  # Remove page breaks & links
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>\s*\n?\s*<a name=\d+></a>', re.IGNORECASE), lambda match: '<br>\n'),
                  
                  # Remove page numbers
                  (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''),

                  # Replace <br><br> with <p>
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'),
                  
                  # un-wrap wrapped lines
                  (re.compile(r'(?<=.{85}[a-z,IA])\s*(?P<ital></i>)?\s*(<p[^>]*>|<br[^>]*>)?\n\r?\s?(?=(<i>)?\w)', re.UNICODE), wrap_lines),
           
                  # Add space before and after italics
                  (re.compile(r'(?<!“)<i>'), lambda match: ' <i>'),
                  (re.compile(r'</i>(?=\w)'), lambda match: '</i> '),
                  
                  # Remove hyphenation
                  (re.compile(r'-\n\r?'), lambda match: ''),
                  
                  # Remove gray background
                  (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
                  
                  # Remove non breaking spaces
                  (re.compile(ur'\u00a0'), lambda match : ' '),
                  
                  # Detect Chapters to match default XPATH in GUI
                  (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head),
                  
                  # Have paragraphs show better
                  (re.compile(r'(?<=.{100})<br[^>]*>\n'), lambda match : '<p>\n'),
                  
                  ]

Last edited by ldolse; 04-13-2009 at 07:09 AM. Reason: Removed redundant line un-wrap rule, changed paragraph min length, adjusted italics regexps
ldolse is offline   Reply With Quote
Advert
Old 04-13-2009, 06:56 AM   #18
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by ldolse
That was why I originally added the check for 85 characters to the second expression. That way only lines which are approaching the length of the entire page are wrapped, as they would likely have been wrapped anyway.
The only thing I worry with this is it assumes the source pdf is an 8.5x11 page and the font is a certain size (12 point I'm assuming). Do you have any ideas for cases such as a pdf sized for an e-reader? The Cybook's page size would be 3.5x4.7 inches.

Quote:
Originally Posted by ldolse
Here are the latest changes, and this creates the cleanest looking results yet
I'll merge them soon.
user_none is offline   Reply With Quote
Old 04-13-2009, 07:06 AM   #19
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
We could make the 85 character check smaller. I'll try to find some pdfs with smaller page sizes and see where they break.

I just added another tweak to that last post to fix the spacing for italicized text. Note that '“' doesn't match for some reason, but haven't had time to figure that one out.
ldolse is offline   Reply With Quote
Old 04-13-2009, 08:18 AM   #20
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by ldolse
We could make the 85 character check smaller. I'll try to find some pdfs with smaller page sizes and see where they break.
I can generate a number of PDFs with varying fonts size for different device profiles. One of the things I added to pluginize is PDF output. I'll do this later today when I get home from work. I can either post them here or send them to you directly.
user_none is offline   Reply With Quote
Advert
Old 04-13-2009, 12:04 PM   #21
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,386
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
the correct way to this would be to calculate average line length
kovidgoyal is offline   Reply With Quote
Old 04-15-2009, 08:35 AM   #22
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
@ldolse, I've modified the it to use a line length algorithm. It uses median length omitting 0 length lines and removes outliers using a double of the average. I haven't tested the results yet because PDF input in pluginze is a bit broken at the moment. I'll let you know how it works once I get that sorted out.
user_none is offline   Reply With Quote
Old 04-15-2009, 08:57 PM   #23
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
While testing I've run into two issues. The regex's are not properly enclosing short lines that should not be wrapped in <p></p> tags. It's also not wrapping any lines for the test pdfs I'm giving it. emailing me directly (john@nachtimwald.com) would probably speed up getting the rules working.
user_none is offline   Reply With Quote
Old 04-22-2009, 04:00 AM   #24
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Matching special punctuation, etc

Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code.

I keep running into situations where I want to handle special characters such as:
“, ”, ‗, ’, ‖, and …

The problem is that they don't get matched if I use a simple regex like below:

Code:
                  # Fix quotes mangled by pdftohtml
                  (re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'), 
                  (re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'),
Next step I tried to use a unicode match:
Code:
 
 (re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'),
But if I attempt that I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII.

I found this page with some info on the error:
http://www.amk.ca/python/howto/unicode

But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions?

Last edited by ldolse; 04-22-2009 at 05:37 AM.
ldolse is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Pdf to Epub conversion - Newby zambosky Calibre 5 06-22-2010 04:12 AM
PDF to EPUB conversion jfontana Calibre 2 03-17-2010 03:09 AM
Help Needed for PDF to Epub Conversion saurabh Morankar ePub 9 12-04-2009 05:10 PM
pdf to epub conversion mediax Sigil 16 11-19-2009 03:48 PM
Help with conversion from PDF to EPUB Fizz Calibre 5 10-25-2009 11:48 AM


All times are GMT -4. The time now is 10:27 PM.


MobileRead.com is a privately owned, operated and funded community.