MobileRead Forums - View Single Post - Regexes to improve pdf to epub conversion

ldolse · 04-07-2009, 03:35 AM

I've just started using Calibre to start converting some PDF novels to epub. I was a bit disappointed with the output at first, but after digging into the XHTML file in the epub I came up with a few regex replacement expressions which massively improve the readability of any ebook novel. I thought these may be of use to some people, so here they are:

Fixing Line Wrapping
The first issue is that all the lines are wrapped based on the original page size of the pdf, so the goal here was to write a regex which detects wrapped lines and 'un-wraps' them:

Code:

Search Pattern:
([a-z,I])\s?(</i>)?\s?(</p><p>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-z])

Replacement Expression:
\1\2\ \5

The above is a relatively inexpensive regex, but it doesn't catch every break - it captures around 90% of the wrapped lines in the couple books I've tested so far. This can be followed up with this expression:

Code:

Search Pattern:
(?=.{85})(.*)([a-z,I])\s?(</i>)?(</p><p>|<br/>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-zA-Z])(?!hapter|HAPTER|PILOGUE|pilogue|rologue|ROLOGUE|bout|BOUT)

Replacement Expression:
\1\2\3 \6

By running both expressions sequentially it looks like I'm getting nearly 100% of the wrapped lines. For some reason running the second by itself also doesn't match 100% of the wrapped lines, so best result is to run both regex replacements, simple one first, expensive one second.

Forcing Line Breaks
The second issue is that that Calibre doesn't reliably add <br/> line breaks to every line termination. In order for paragraphs and conversations to break correctly these need to be added. Note that this expression assumes that you have first run the line un-wrapping expressions above, running this expression by itself will generally make readability worse.

Code:

Search Pattern:
(?<!br/>|p>|head>|body>|html>)$

Replacement Expression:
<br/>

I hope some people find these useful, and if any regex experts have some advice for improvements it would be appreciated. I couldn't find any way to make the second expression non-greedy because of the look-ahead pattern. I'll integrate any improvements into this post.

I'd love to see something like this built directly into Calibre to automatically do this when converting pdf files to eliminate the manual processing one needs to do.

Instructions
For anyone reading the above and is interested in continuing, but has no idea how to proceed, here are some high level instructions:

Get a text editor that supports regex replacements - emeditor or textpad on Windows, Text Wrangler or Smultron on Mac, Unix experts could use a shell script
Unzip the original epub file output by Calibre using any archive utility, the utility may require the .epub extension be changed to .zip
From the extracted contents, open the file "/content/index.xhtml" in your text editor
Using the find and replace function, specify the search pattern is a regex, and use the search and replace patterns from this post in order
If you have no interest in Chapter Splits, Save the file and re-zip the archive, delete the original and give the new archive the same name. If you're interested in having Chapters properly split, read on.
Each chapter heading needs to be surrounded by <h1></h1> or <h2></h2> tags to be detected as a chapter using Calibre's Xpath expression. Each book will have slightly different layout of chapters and chapter titles, but it's simple to right a search and replace regex to surround all the chapters headings with these tags.
Save the file after adjusting each chapter or section header. Now go back to Calibre, edit the book, and click the button to add a new format.
Navigate to the index.xhtml file you just saved and have Calibre import that as an additional ebook.
Right click on the book, select convert e-books -> Convert individually. Select the zip archive from the list of formats and proceed through the conversion dialogs. Calibre will then create an ebook with proper chapter splits.

Caveats

I used Text Wrangler on OS X to create these expressions - other regex implementations may require slightly different regexes, and
Many text editors supporting replacement expressions use different syntaxes for the replacements. For example in some cases "\1\2 \5" would be "$(1)$(2) $(5)" or something similar.
There are a few scenarios where these regexes may unwrap lines that shouldn't be unwrapped, but these should be minimal - most of the complexity in the expressions is there to prevent this from occurring.

04-07-2009, 03:35 AM	#1
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Regexes to improve pdf to epub conversion I've just started using Calibre to start converting some PDF novels to epub. I was a bit disappointed with the output at first, but after digging into the XHTML file in the epub I came up with a few regex replacement expressions which massively improve the readability of any ebook novel. I thought these may be of use to some people, so here they are: Fixing Line Wrapping The first issue is that all the lines are wrapped based on the original page size of the pdf, so the goal here was to write a regex which detects wrapped lines and 'un-wraps' them: Code: Search Pattern: ([a-z,I])\s?(</i>)?\s?(</p><p>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-z]) Replacement Expression: \1\2\ \5 The above is a relatively inexpensive regex, but it doesn't catch every break - it captures around 90% of the wrapped lines in the couple books I've tested so far. This can be followed up with this expression: Code: Search Pattern: (?=.{85})(.)([a-z,I])\s?(</i>)?(</p><p>\|<br/>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-zA-Z])(?!hapter\|HAPTER\|PILOGUE\|pilogue\|rologue\|ROLOGUE\|bout\|BOUT) Replacement Expression: \1\2\3 \6 By running both expressions sequentially it looks like I'm getting nearly 100% of the wrapped lines. For some reason running the second by itself also doesn't match 100% of the wrapped lines, so best result is to run both regex replacements, simple one first, expensive one second. Forcing Line Breaks* The second issue is that that Calibre doesn't reliably add <br/> line breaks to every line termination. In order for paragraphs and conversations to break correctly these need to be added. Note that this expression assumes that you have first run the line un-wrapping expressions above, running this expression by itself will generally make readability worse. Code: Search Pattern: (?<!br/>\|p>\|head>\|body>\|html>)$ Replacement Expression: <br/> I hope some people find these useful, and if any regex experts have some advice for improvements it would be appreciated. I couldn't find any way to make the second expression non-greedy because of the look-ahead pattern. I'll integrate any improvements into this post. I'd love to see something like this built directly into Calibre to automatically do this when converting pdf files to eliminate the manual processing one needs to do. Instructions For anyone reading the above and is interested in continuing, but has no idea how to proceed, here are some high level instructions: Get a text editor that supports regex replacements - emeditor or textpad on Windows, Text Wrangler or Smultron on Mac, Unix experts could use a shell script Unzip the original epub file output by Calibre using any archive utility, the utility may require the .epub extension be changed to .zip From the extracted contents, open the file "/content/index.xhtml" in your text editor Using the find and replace function, specify the search pattern is a regex, and use the search and replace patterns from this post in order If you have no interest in Chapter Splits, Save the file and re-zip the archive, delete the original and give the new archive the same name. If you're interested in having Chapters properly split, read on. Each chapter heading needs to be surrounded by <h1></h1> or <h2></h2> tags to be detected as a chapter using Calibre's Xpath expression. Each book will have slightly different layout of chapters and chapter titles, but it's simple to right a search and replace regex to surround all the chapters headings with these tags. Save the file after adjusting each chapter or section header. Now go back to Calibre, edit the book, and click the button to add a new format. Navigate to the index.xhtml file you just saved and have Calibre import that as an additional ebook. Right click on the book, select convert e-books -> Convert individually. Select the zip archive from the list of formats and proceed through the conversion dialogs. Calibre will then create an ebook with proper chapter splits. Caveats I used Text Wrangler on OS X to create these expressions - other regex implementations may require slightly different regexes, and Many text editors supporting replacement expressions use different syntaxes for the replacements. For example in some cases "\1\2 \5" would be "$(1)$(2) $(5)" or something similar. There are a few scenarios where these regexes may unwrap lines that shouldn't be unwrapped, but these should be minimal - most of the complexity in the expressions is there to prevent this from occurring. Last edited by ldolse; 04-07-2009 at 03:41 AM.