![]() |
#16 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Just found the solution to the line breaks, easier to do with two regexes. Here's the code for the area I changed:
Code:
def wrap_lines(match): ital = match.group('ital') if not ital: return ' ' else: return ital+' ' def chap_head(match): chap = match.group('chap') title = match.group('title') if not title: return '<h1>'+chap+'</h1><br/>' else: return '<h1>'+chap+'<br/>'+title+'</h1><br/>' class PreProcessor(object): PREPROCESS = [ # Some idiotic HTML generators (Frontpage I'm looking at you) # Put all sorts of crap into <head>. This messes up lxml (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), sanitize_head), # Convert all entities, since lxml doesn't handle them well (re.compile(r'&(\S+?);'), convert_entities), # Remove the <![if/endif tags inserted by everybody's darling, MS Word (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), lambda match: ''), ] # Fix pdftohtml markup PDFTOHTML = [ # Remove <hr> tags (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'), # Remove page links (re.compile(r'<a name=\d+></a>', re.IGNORECASE), lambda match: ''), # Remove page numbers (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''), # Remove <br> and replace <br><br> with <p> (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'), (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 else match.group(1)), # un-wrap wrapped lines - uses two regexes (re.compile(r'(?<=[a-z,I])\s*(?P<ital></i>)?\s*(</p><p>)?\n\r?\s?(?=(<i>)?\w)', re.DOTALL), wrap_lines), (re.compile(r'(?<=.{85}[a-z,I])\s*(<p[^>]*>|<br[^>]*>)\s*(?=\w)', re.UNICODE), lambda match: ' '), # Add space before italics (re.compile(r'<i>'), lambda match: '<i> '), # Remove hyphenation (re.compile(r'-\n\r?'), lambda match: ''), # Remove gray background (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'), # Remove non breaking spaces (re.compile(ur'\u00a0'), lambda match : ' '), # Detect Chapters (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head), # Have paragraphs show better (re.compile(r'(?<=.{85})<br[^>]*>\n'), lambda match : '<p>\n'), # terminate unterminated lines. (re.compile(r'(?<!>)\s*\n'), lambda match : '<br/>\n'), (re.compile(r'</i>\s*\n'), lambda match : '</i><br/>\n'), ] |
![]() |
![]() |
![]() |
#17 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Sorry to keep posting changes. Turns out that this regexp:
Code:
# Remove <br> and replace <br><br> with <p> (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 else match.group(1)), I've also modified page link stripping regexp further to completely get rid of the breaks. Here are the latest changes, and this creates the cleanest looking results yet: Code:
def sanitize_head(match): x = match.group(1) x = _span_pat.sub('', x) return '<head>\n'+x+'\n</head>' def wrap_lines(match): ital = match.group('ital') if not ital: return ' ' else: return ital+' ' def chap_head(match): chap = match.group('chap') title = match.group('title') if not title: return '<h1>'+chap+'</h1><br/>' else: return '<h1>'+chap+'<br/>'+title+'</h1><br/>' class PreProcessor(object): PREPROCESS = [ # Some idiotic HTML generators (Frontpage I'm looking at you) # Put all sorts of crap into <head>. This messes up lxml (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), sanitize_head), # Convert all entities, since lxml doesn't handle them well (re.compile(r'&(\S+?);'), convert_entities), # Remove the <![if/endif tags inserted by everybody's darling, MS Word (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), lambda match: ''), ] # Fix pdftohtml markup PDFTOHTML = [ # Remove <hr> tags (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'), # Remove page breaks & links (re.compile(r'<br[^>]*>\s*<br[^>]*>\s*\n?\s*<a name=\d+></a>', re.IGNORECASE), lambda match: '<br>\n'), # Remove page numbers (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''), # Replace <br><br> with <p> (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'), # un-wrap wrapped lines (re.compile(r'(?<=.{85}[a-z,IA])\s*(?P<ital></i>)?\s*(<p[^>]*>|<br[^>]*>)?\n\r?\s?(?=(<i>)?\w)', re.UNICODE), wrap_lines), # Add space before and after italics (re.compile(r'(?<!“)<i>'), lambda match: ' <i>'), (re.compile(r'</i>(?=\w)'), lambda match: '</i> '), # Remove hyphenation (re.compile(r'-\n\r?'), lambda match: ''), # Remove gray background (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'), # Remove non breaking spaces (re.compile(ur'\u00a0'), lambda match : ' '), # Detect Chapters to match default XPATH in GUI (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head), # Have paragraphs show better (re.compile(r'(?<=.{100})<br[^>]*>\n'), lambda match : '<p>\n'), ] Last edited by ldolse; 04-13-2009 at 07:09 AM. Reason: Removed redundant line un-wrap rule, changed paragraph min length, adjusted italics regexps |
![]() |
![]() |
Advert | |
|
![]() |
#18 | ||
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#19 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
We could make the 85 character check smaller. I'll try to find some pdfs with smaller page sizes and see where they break.
I just added another tweak to that last post to fix the spacing for italicized text. Note that '“' doesn't match for some reason, but haven't had time to figure that one out. |
![]() |
![]() |
![]() |
#20 | |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
|
|
![]() |
![]() |
Advert | |
|
![]() |
#21 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,386
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
the correct way to this would be to calculate average line length
|
![]() |
![]() |
![]() |
#22 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
@ldolse, I've modified the it to use a line length algorithm. It uses median length omitting 0 length lines and removes outliers using a double of the average. I haven't tested the results yet because PDF input in pluginze is a bit broken at the moment. I'll let you know how it works once I get that sorted out.
|
![]() |
![]() |
![]() |
#23 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
While testing I've run into two issues. The regex's are not properly enclosing short lines that should not be wrapped in <p></p> tags. It's also not wrapping any lines for the test pdfs I'm giving it. emailing me directly (john@nachtimwald.com) would probably speed up getting the rules working.
|
![]() |
![]() |
![]() |
#24 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Matching special punctuation, etc
Still working on getting a dev environment up and running for the pluginize issues, but in the meantime I was testing some refinements in the released code.
I keep running into situations where I want to handle special characters such as: “, ”, ‗, ’, ‖, and … The problem is that they don't get matched if I use a simple regex like below: Code:
# Fix quotes mangled by pdftohtml (re.compile(r'‗([^‗]+)‘(?!\w)'), lambda match: '‘'+match.group(1)+'’'), (re.compile(r'―([^―]+)‖'), lambda match: '‘'+match.group(1)+'’'), Code:
(re.compile(ur'\u2015([^―]+)\u2016', re.UNICODE), lambda match: '‘'+match.group(1)+'’'), UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) Which I believe is telling me that the whole PDFTOHTML function is actually being handled as ASCII. I found this page with some info on the error: http://www.amk.ca/python/howto/unicode But I can't figure out how to apply this advice to getting these regexes looking at the UTF-8 data the doc is supposed to be encoded in instead of ASCII. Any suggestions? Last edited by ldolse; 04-22-2009 at 05:37 AM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Pdf to Epub conversion - Newby | zambosky | Calibre | 5 | 06-22-2010 04:12 AM |
PDF to EPUB conversion | jfontana | Calibre | 2 | 03-17-2010 03:09 AM |
Help Needed for PDF to Epub Conversion | saurabh Morankar | ePub | 9 | 12-04-2009 05:10 PM |
pdf to epub conversion | mediax | Sigil | 16 | 11-19-2009 03:48 PM |
Help with conversion from PDF to EPUB | Fizz | Calibre | 5 | 10-25-2009 11:48 AM |