04-07-2009, 03:35 AM | #1 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Regexes to improve pdf to epub conversion
I've just started using Calibre to start converting some PDF novels to epub. I was a bit disappointed with the output at first, but after digging into the XHTML file in the epub I came up with a few regex replacement expressions which massively improve the readability of any ebook novel. I thought these may be of use to some people, so here they are:
Fixing Line Wrapping The first issue is that all the lines are wrapped based on the original page size of the pdf, so the goal here was to write a regex which detects wrapped lines and 'un-wraps' them: Code:
Search Pattern: ([a-z,I])\s?(</i>)?\s?(</p><p>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-z]) Replacement Expression: \1\2\ \5 Code:
Search Pattern: (?=.{85})(.*)([a-z,I])\s?(</i>)?(</p><p>|<br/>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-zA-Z])(?!hapter|HAPTER|PILOGUE|pilogue|rologue|ROLOGUE|bout|BOUT) Replacement Expression: \1\2\3 \6 Forcing Line Breaks The second issue is that that Calibre doesn't reliably add <br/> line breaks to every line termination. In order for paragraphs and conversations to break correctly these need to be added. Note that this expression assumes that you have first run the line un-wrapping expressions above, running this expression by itself will generally make readability worse. Code:
Search Pattern: (?<!br/>|p>|head>|body>|html>)$ Replacement Expression: <br/> I'd love to see something like this built directly into Calibre to automatically do this when converting pdf files to eliminate the manual processing one needs to do. Instructions For anyone reading the above and is interested in continuing, but has no idea how to proceed, here are some high level instructions:
Caveats
Last edited by ldolse; 04-07-2009 at 03:41 AM. |
04-07-2009, 12:10 PM | #2 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
calibre already runs regexps on the output of pdftohtml to detect line endings, but it's been a while since I optimized them, they're in the file html.py in the calibre source code, so you're welcome to suggest enhancements to them
|
Advert | |
|
04-07-2009, 12:47 PM | #3 |
Member
Posts: 11
Karma: 10
Join Date: Apr 2009
Device: Calibre
|
ok what happens when calibre is detecting the page numbers of a pdf file as chapters instead of the chapters???
it happens when i convert the file to epub what should i do |
04-07-2009, 12:48 PM | #4 |
Member
Posts: 11
Karma: 10
Join Date: Apr 2009
Device: Calibre
|
or is it anyone how will be willing to fix that for me and sent me the epub file ready?? lol
just asking |
04-07-2009, 01:45 PM | #5 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Cool, I'll check out html.py and open some bugs. I think the two 1st and the third regexes are pretty conservative so I'll see how they work across some other novels. The third one is a bit more questionable, it works well with what I've tested so far, but I'm not sure if it's worth the time expense for most users with the existing greedy matches.
@StefTeamEdward, haven't run into a PDF with page numbers that need dealing with yet, so haven't put much thought there, sorry. So far Calibre doesn't recognize any chapters at all in the pdfs I've converted. |
Advert | |
|
04-07-2009, 04:17 PM | #6 | |
Member
Posts: 11
Karma: 10
Join Date: Apr 2009
Device: Calibre
|
Quote:
|
|
04-07-2009, 07:35 PM | #7 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Where is html.py? I'm on OS X, I browsed through the App package directories, but I couldn't find it.
|
04-07-2009, 10:13 PM | #8 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
http://bazaar.launchpad.net/%7Ekovid...ebooks/html.py
You can make changes then "install" them in your calibre by running calibre-debug --update-module calibre.ebooks.html,/path/to/new/html.py |
04-08-2009, 10:05 AM | #9 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
hmm... never coded in Python before.
I found the correct function I believe: Code:
# Fix pdftohtml markup PDFTOHTML = [ # Remove <hr> tags (re.compile(r'<hr.*?>', re.IGNORECASE), lambda match: '<br />'), # Remove page numbers (re.compile(r'\d+<br>', re.IGNORECASE), lambda match: ''), # Remove <br> and replace <br><br> with <p> (re.compile(r'<br.*?>\s*<br.*?>', re.IGNORECASE), lambda match: '<p>'), (re.compile(r'(.*)<br.*?>', re.IGNORECASE), lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 else match.group(1)), # Remove hyphenation (re.compile(r'-\n\r?'), lambda match: ''), # Remove gray background (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'), # Remove non breaking spaces (re.compile(ur'\u00a0'), lambda match : ' '), ] Code:
(re.compile(r'(.*)<br.*?>', re.IGNORECASE), lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 else match.group(1)), Any advice to point me in the right direction? |
04-08-2009, 11:01 AM | #10 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You dont have to use a lambda function you can use a normal function as well
For example Code:
def my_function(match): length_of_backrefs = len(''.join([match.group(i) for i in range(1, 4)])) full_string = match.group() # do something to full string return substituted_string (re.compile(r'(.*)<br.*?>', re.IGNORECASE), my_function), |
04-09-2009, 05:35 PM | #11 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
@ldolse, I used your regexes as a base and simplified them for Calibre's pdf input in the development branch (pluginize).
|
04-10-2009, 01:51 AM | #12 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
@user_none, where can I go to check out what you've done? I tried searching for pluginize through Launchpad, but didn't find anything. If pluginize refers to some sort of plugin architecture that users can enable disable functions that sounds like a better place to maintain them than html.py anyway.
Kovid's advice was enough to get me going, so I already started making some changes to my local copy of html.py. I don't want to duplicate any effort when it comes to submitting changes. I discovered some problems with at least one of the regexps there which causes some anomalies in some books, so I was going to submit those fixes at a minimum. btw, I've tested on a number of other books. While the first regex doesn't get every wrapped line I haven't seen any scenario where it makes things worse. I've found the second regex will wrap things like page headers and footers(since they lack punctuation), which just winds up making those even more difficult to take out later. |
04-10-2009, 01:58 AM | #13 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I found pluginize, so no need to educate me on the branch, but still trying to find the link to the changes...
|
04-10-2009, 06:57 AM | #14 | |||
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
/src/calibre/ebooks/conversion/preprocess.py is where the processing rules for pdftohtml have moved to in pluginize. Right now you can find the changes in my driver-dev branch because Kovid hasn't merged them into the main branch yet. If you have any questions about the branches or what's going on with development, just ask, off board or on.
One thing to realize about the regex rules. They are not being applied to the raw output of pdftohtml. They build on one another so don't forget to take into account the rules before and how they change the markup. Quote:
Quote:
Quote:
|
|||
04-12-2009, 10:19 PM | #15 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Cool, found the changes now. I've just started using look-around functions, so still getting used to the use cases. I tried out your modifications, and I'm seeing a couple things, one is that in many cases lines aren't getting un-wrapped, and the other is that lines which shouldn't be wrapped are now being wrapped.
Many lines don't get wrapped because not every line feed is accompanied by <br> or <p>. The secondary issue is lines that are getting wrapped that shouldn't. This primarily affects poetry, as poetry lines are relatively short and lack punctuation. I can also affect conversations, depending on how they are formatted. That was why I originally added the check for 85 characters to the second expression. That way only lines which are approaching the length of the entire page are wrapped, as they would likely have been wrapped anyway. I've merged your optimizations with mine, here's my complete PDFTOHTML function. Code:
def wrap_lines(match): ital = match.group('ital') if not ital: return ' ' else: return ital+' ' def chap_head(match): chap = match.group('chap') title = match.group('title') if not title: return '<h1>'+chap+'</h1><br/>' else: return '<h1>'+chap+'<br/>'+title+'</h1><br/>' class PreProcessor(object): PREPROCESS = [ # Some idiotic HTML generators (Frontpage I'm looking at you) # Put all sorts of crap into <head>. This messes up lxml (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), sanitize_head), # Convert all entities, since lxml doesn't handle them well (re.compile(r'&(\S+?);'), convert_entities), # Remove the <![if/endif tags inserted by everybody's darling, MS Word (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), lambda match: ''), ] # Fix pdftohtml markup PDFTOHTML = [ # Remove <hr> tags (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'), # Remove page links (re.compile(r'<a name=\d+></a>', re.IGNORECASE), lambda match: ''), # Remove page numbers (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''), # Remove <br> and replace <br><br> with <p> (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'), (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 else match.group(1)), # un-wrap wrapped lines - uses two regexes (re.compile(r'(?<=[a-z,I])\s*(?P<ital></i>)?\s*(</p><p>)?\n\r?\s?(?=(<i>)?\w)', re.DOTALL), wrap_lines), (re.compile(r'(?<=.{85}[a-z,I])\s*(<p[^>]*>|<br[^>]*>)\s*(?=\w)', re.UNICODE), lambda match: ' '), # Add space before italics (re.compile(r'<i>'), lambda match: '<i> '), # Remove hyphenation (re.compile(r'-\n\r?'), lambda match: ''), # Remove gray background (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'), # Remove non breaking spaces (re.compile(ur'\u00a0'), lambda match : ' '), # Detect Chapters (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head), # Have paragraphs show better (re.compile(r'(?<=.{85})<br[^>]*>$'), lambda match : '<p>'), # terminate unterminated lines. # (re.compile(r'(?<!r/>|<p>|/p>|ad>|dy>|ml>)\n'), lambda match : '<br/>\n'), (re.compile(r'(\.|”|\")\s*\n'), lambda match : match.group(1)+'<br/>\n'), ]
I've been struggling a bit getting the regexes right, as there is another stage of processing that the html goes through somewhere(lxml??), so I'm working a bit blindly. Chapters are represented very differently at the time the HTML hits the PDFTOHTML function compared to the final output. Also, for some reason adding the chapter detection regexp causes <p></p> to come out differently in the final version, with margin and border settings all to 0 points. Lastly, I can't seem to get the regex to terminate line endings working. This is the regex I'd like to use: Code:
(re.compile(r'(?<!r/>|<p>|/p>|ad>|dy>|ml>)\n'), lambda match : '<br/>\n'), I'm using this one as an alternate: Code:
(re.compile(r'(\.|”|\")\s*\n'), lambda match : match.group(1)+'<br/>\n'), |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Pdf to Epub conversion - Newby | zambosky | Calibre | 5 | 06-22-2010 04:12 AM |
PDF to EPUB conversion | jfontana | Calibre | 2 | 03-17-2010 03:09 AM |
Help Needed for PDF to Epub Conversion | saurabh Morankar | ePub | 9 | 12-04-2009 05:10 PM |
pdf to epub conversion | mediax | Sigil | 16 | 11-19-2009 03:48 PM |
Help with conversion from PDF to EPUB | Fizz | Calibre | 5 | 10-25-2009 11:48 AM |