Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 04-07-2009, 03:35 AM   #1
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Regexes to improve pdf to epub conversion

I've just started using Calibre to start converting some PDF novels to epub. I was a bit disappointed with the output at first, but after digging into the XHTML file in the epub I came up with a few regex replacement expressions which massively improve the readability of any ebook novel. I thought these may be of use to some people, so here they are:

Fixing Line Wrapping
The first issue is that all the lines are wrapped based on the original page size of the pdf, so the goal here was to write a regex which detects wrapped lines and 'un-wraps' them:

Code:
Search Pattern:
([a-z,I])\s?(</i>)?\s?(</p><p>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-z])

Replacement Expression:
\1\2\ \5
The above is a relatively inexpensive regex, but it doesn't catch every break - it captures around 90% of the wrapped lines in the couple books I've tested so far. This can be followed up with this expression:
Code:
Search Pattern:
(?=.{85})(.*)([a-z,I])\s?(</i>)?(</p><p>|<br/>)?\r(<a name="\d+" id="\d+"/>)?\s?((<i>)?[a-zA-Z])(?!hapter|HAPTER|PILOGUE|pilogue|rologue|ROLOGUE|bout|BOUT)

Replacement Expression:
\1\2\3 \6
By running both expressions sequentially it looks like I'm getting nearly 100% of the wrapped lines. For some reason running the second by itself also doesn't match 100% of the wrapped lines, so best result is to run both regex replacements, simple one first, expensive one second.

Forcing Line Breaks
The second issue is that that Calibre doesn't reliably add <br/> line breaks to every line termination. In order for paragraphs and conversations to break correctly these need to be added. Note that this expression assumes that you have first run the line un-wrapping expressions above, running this expression by itself will generally make readability worse.
Code:
Search Pattern:
(?<!br/>|p>|head>|body>|html>)$

Replacement Expression:
<br/>
I hope some people find these useful, and if any regex experts have some advice for improvements it would be appreciated. I couldn't find any way to make the second expression non-greedy because of the look-ahead pattern. I'll integrate any improvements into this post.

I'd love to see something like this built directly into Calibre to automatically do this when converting pdf files to eliminate the manual processing one needs to do.

Instructions
For anyone reading the above and is interested in continuing, but has no idea how to proceed, here are some high level instructions:
  1. Get a text editor that supports regex replacements - emeditor or textpad on Windows, Text Wrangler or Smultron on Mac, Unix experts could use a shell script
  2. Unzip the original epub file output by Calibre using any archive utility, the utility may require the .epub extension be changed to .zip
  3. From the extracted contents, open the file "/content/index.xhtml" in your text editor
  4. Using the find and replace function, specify the search pattern is a regex, and use the search and replace patterns from this post in order
  5. If you have no interest in Chapter Splits, Save the file and re-zip the archive, delete the original and give the new archive the same name. If you're interested in having Chapters properly split, read on.
  6. Each chapter heading needs to be surrounded by <h1></h1> or <h2></h2> tags to be detected as a chapter using Calibre's Xpath expression. Each book will have slightly different layout of chapters and chapter titles, but it's simple to right a search and replace regex to surround all the chapters headings with these tags.
  7. Save the file after adjusting each chapter or section header. Now go back to Calibre, edit the book, and click the button to add a new format.
  8. Navigate to the index.xhtml file you just saved and have Calibre import that as an additional ebook.
  9. Right click on the book, select convert e-books -> Convert individually. Select the zip archive from the list of formats and proceed through the conversion dialogs. Calibre will then create an ebook with proper chapter splits.


Caveats
  • I used Text Wrangler on OS X to create these expressions - other regex implementations may require slightly different regexes, and
  • Many text editors supporting replacement expressions use different syntaxes for the replacements. For example in some cases "\1\2 \5" would be "$(1)$(2) $(5)" or something similar.
  • There are a few scenarios where these regexes may unwrap lines that shouldn't be unwrapped, but these should be minimal - most of the complexity in the expressions is there to prevent this from occurring.

Last edited by ldolse; 04-07-2009 at 03:41 AM.
ldolse is offline   Reply With Quote
Old 04-07-2009, 12:10 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
calibre already runs regexps on the output of pdftohtml to detect line endings, but it's been a while since I optimized them, they're in the file html.py in the calibre source code, so you're welcome to suggest enhancements to them
kovidgoyal is offline   Reply With Quote
Advert
Old 04-07-2009, 12:47 PM   #3
StefTeamEdward
Member
StefTeamEdward began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Apr 2009
Device: Calibre
ok what happens when calibre is detecting the page numbers of a pdf file as chapters instead of the chapters???
it happens when i convert the file to epub
what should i do
StefTeamEdward is offline   Reply With Quote
Old 04-07-2009, 12:48 PM   #4
StefTeamEdward
Member
StefTeamEdward began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Apr 2009
Device: Calibre
or is it anyone how will be willing to fix that for me and sent me the epub file ready?? lol
just asking
StefTeamEdward is offline   Reply With Quote
Old 04-07-2009, 01:45 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Cool, I'll check out html.py and open some bugs. I think the two 1st and the third regexes are pretty conservative so I'll see how they work across some other novels. The third one is a bit more questionable, it works well with what I've tested so far, but I'm not sure if it's worth the time expense for most users with the existing greedy matches.

@StefTeamEdward, haven't run into a PDF with page numbers that need dealing with yet, so haven't put much thought there, sorry. So far Calibre doesn't recognize any chapters at all in the pdfs I've converted.
ldolse is offline   Reply With Quote
Advert
Old 04-07-2009, 04:17 PM   #6
StefTeamEdward
Member
StefTeamEdward began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Apr 2009
Device: Calibre
Quote:
Originally Posted by ldolse View Post
Cool, I'll check out html.py and open some bugs. I think the two 1st and the third regexes are pretty conservative so I'll see how they work across some other novels. The third one is a bit more questionable, it works well with what I've tested so far, but I'm not sure if it's worth the time expense for most users with the existing greedy matches.

@StefTeamEdward, haven't run into a PDF with page numbers that need dealing with yet, so haven't put much thought there, sorry. So far Calibre doesn't recognize any chapters at all in the pdfs I've converted.
ok thanks for replying
StefTeamEdward is offline   Reply With Quote
Old 04-07-2009, 07:35 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Where is html.py? I'm on OS X, I browsed through the App package directories, but I couldn't find it.
ldolse is offline   Reply With Quote
Old 04-07-2009, 10:13 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://bazaar.launchpad.net/%7Ekovid...ebooks/html.py


You can make changes then "install" them in your calibre by running

calibre-debug --update-module calibre.ebooks.html,/path/to/new/html.py
kovidgoyal is offline   Reply With Quote
Old 04-08-2009, 10:05 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
hmm... never coded in Python before.

I found the correct function I believe:

Code:
    # Fix pdftohtml markup
    PDFTOHTML  = [
                  # Remove <hr> tags
                  (re.compile(r'<hr.*?>', re.IGNORECASE), lambda match: '<br />'),
                  # Remove page numbers
                  (re.compile(r'\d+<br>', re.IGNORECASE), lambda match: ''),
                  # Remove <br> and replace <br><br> with <p>
                  (re.compile(r'<br.*?>\s*<br.*?>', re.IGNORECASE), lambda match: '<p>'),
                  (re.compile(r'(.*)<br.*?>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),
                  # Remove hyphenation
                  (re.compile(r'-\n\r?'), lambda match: ''),
                  
                  # Remove gray background
                  (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
                  
                  # Remove non breaking spaces
                  (re.compile(ur'\u00a0'), lambda match : ' '),
                  
                  ]
Looks straightforward for replacing basic expressions, however I can't figure out how to use backreferences with lambda match. It looks to me like this snippet from above is close to what I want:
Code:
 (re.compile(r'(.*)<br.*?>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),
Looks to me like group(1) is the back reference, and the function is checking to see if the length of the match is less than 40 characters before replacing the text. But in my case I've got 2-4 backreferences I need to concatenate, probably using the join() function somehow. I searched around a bit on the net, but I didn't see any good examples, at least not ones that are remotely similar to how you've structured the functions above.

Any advice to point me in the right direction?
ldolse is offline   Reply With Quote
Old 04-08-2009, 11:01 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You dont have to use a lambda function you can use a normal function as well


For example

Code:
def my_function(match):
   length_of_backrefs = len(''.join([match.group(i) for i in range(1, 4)]))
   full_string = match.group()
   # do something to full string
   return substituted_string
And in the regexp list

(re.compile(r'(.*)<br.*?>', re.IGNORECASE), my_function),
kovidgoyal is offline   Reply With Quote
Old 04-09-2009, 05:35 PM   #11
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
@ldolse, I used your regexes as a base and simplified them for Calibre's pdf input in the development branch (pluginize).
user_none is offline   Reply With Quote
Old 04-10-2009, 01:51 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
@user_none, where can I go to check out what you've done? I tried searching for pluginize through Launchpad, but didn't find anything. If pluginize refers to some sort of plugin architecture that users can enable disable functions that sounds like a better place to maintain them than html.py anyway.

Kovid's advice was enough to get me going, so I already started making some changes to my local copy of html.py. I don't want to duplicate any effort when it comes to submitting changes. I discovered some problems with at least one of the regexps there which causes some anomalies in some books, so I was going to submit those fixes at a minimum.

btw, I've tested on a number of other books. While the first regex doesn't get every wrapped line I haven't seen any scenario where it makes things worse. I've found the second regex will wrap things like page headers and footers(since they lack punctuation), which just winds up making those even more difficult to take out later.
ldolse is offline   Reply With Quote
Old 04-10-2009, 01:58 AM   #13
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I found pluginize, so no need to educate me on the branch, but still trying to find the link to the changes...
ldolse is offline   Reply With Quote
Old 04-10-2009, 06:57 AM   #14
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
/src/calibre/ebooks/conversion/preprocess.py is where the processing rules for pdftohtml have moved to in pluginize. Right now you can find the changes in my driver-dev branch because Kovid hasn't merged them into the main branch yet. If you have any questions about the branches or what's going on with development, just ask, off board or on.

One thing to realize about the regex rules. They are not being applied to the raw output of pdftohtml. They build on one another so don't forget to take into account the rules before and how they change the markup.

Quote:
Originally Posted by ldolse
If pluginize refers to some sort of plugin architecture that users can enable disable functions...
That is the general idea. Not everything is being moved into a plugin though. The processing rules for pdftohtml won't be a plugin but PDF input itself is a plugin.

Quote:
Originally Posted by ldolse
I've found the second regex will wrap things like page headers and footers(since they lack punctuation)
Indeed this happens. I'm happy to merge any fixes for this that you come up with.

Quote:
Originally Posted by ldolse
I don't want to duplicate any effort when it comes to submitting changes.
Don't worry about it. The better solution wins. Especially in this care where regexes can always be improved to take into account more cases. All I did was spend a few minutes fiddling with your rules to get them working with the other processing rules. I also simplified them to use a look behind and look ahead instead of match groups because I find them easier to work with. At the very least my changes will help you understand how the regexes work in the preprocessor.
user_none is offline   Reply With Quote
Old 04-12-2009, 10:19 PM   #15
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Cool, found the changes now. I've just started using look-around functions, so still getting used to the use cases. I tried out your modifications, and I'm seeing a couple things, one is that in many cases lines aren't getting un-wrapped, and the other is that lines which shouldn't be wrapped are now being wrapped.

Many lines don't get wrapped because not every line feed is accompanied by <br> or <p>.

The secondary issue is lines that are getting wrapped that shouldn't. This primarily affects poetry, as poetry lines are relatively short and lack punctuation. I can also affect conversations, depending on how they are formatted. That was why I originally added the check for 85 characters to the second expression. That way only lines which are approaching the length of the entire page are wrapped, as they would likely have been wrapped anyway.

I've merged your optimizations with mine, here's my complete PDFTOHTML function.

Code:
def wrap_lines(match):
    ital = match.group('ital')
    if not ital: 
               return ' '
    else: 
               return ital+' '

def chap_head(match):
    chap = match.group('chap')
    title = match.group('title')
    if not title: 
               return '<h1>'+chap+'</h1><br/>'
    else: 
               return '<h1>'+chap+'<br/>'+title+'</h1><br/>'
    
    
class PreProcessor(object):
    PREPROCESS = [
                  # Some idiotic HTML generators (Frontpage I'm looking at you)
                  # Put all sorts of crap into <head>. This messes up lxml
                  (re.compile(r'<head[^>]*>(.*?)</head>', re.IGNORECASE|re.DOTALL), 
                   sanitize_head),
                  # Convert all entities, since lxml doesn't handle them well
                  (re.compile(r'&(\S+?);'), convert_entities),
                  # Remove the <![if/endif tags inserted by everybody's darling, MS Word
                  (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), 
                   lambda match: ''),
                  ]
                                          
    # Fix pdftohtml markup
    PDFTOHTML  = [
                  # Remove <hr> tags
                  (re.compile(r'<hr[^>]*>', re.IGNORECASE), lambda match: '<br />'),
                  
                  # Remove page links
                  (re.compile(r'<a name=\d+></a>', re.IGNORECASE), lambda match: ''),
                  
                  # Remove page numbers
                  (re.compile(r'(?<!\w)\s(\d+)\s?(<br>|<br/>|</p><p>)', re.IGNORECASE), lambda match: ''),
                  # Remove <br> and replace <br><br> with <p>
                  (re.compile(r'<br[^>]*>\s*<br[^>]*>', re.IGNORECASE), lambda match: '<p>'),
                  (re.compile(r'(.*)<br[^>]*>', re.IGNORECASE), 
                   lambda match: match.group() if re.match('<', match.group(1).lstrip()) or len(match.group(1)) < 40 
                                else match.group(1)),
                  
                  # un-wrap wrapped lines - uses two regexes
                  (re.compile(r'(?<=[a-z,I])\s*(?P<ital></i>)?\s*(</p><p>)?\n\r?\s?(?=(<i>)?\w)', re.DOTALL), wrap_lines),
                  (re.compile(r'(?<=.{85}[a-z,I])\s*(<p[^>]*>|<br[^>]*>)\s*(?=\w)', re.UNICODE), lambda match: ' '),
           
                  # Add space before italics
                  (re.compile(r'<i>'), lambda match: '<i> '),
                  
                  # Remove hyphenation
                  (re.compile(r'-\n\r?'), lambda match: ''),
                  
                  # Remove gray background
                  (re.compile(r'<BODY[^<>]+>'), lambda match : '<BODY>'),
                  
                  # Remove non breaking spaces
                  (re.compile(ur'\u00a0'), lambda match : ' '),
                  
                  # Detect Chapters
                  (re.compile(r'(<br[^>]*>)?(</?p[^>]*>)?s*(?P<chap>(Chapter|Epilogue|Prologue|Book|Part)\s*(\d+|\w+)?)(</?p[^>]*>|<br[^>]*>)\n?((?=(<i>)?\s*\w+(\s+\w+)?(</i>)?(<br[^>]*>|</?p[^>]*>))((?P<title>.*)(<br[^>]*>|</?p[^>]*>)))?', re.IGNORECASE), chap_head),
                  
                  # Have paragraphs show better
                  (re.compile(r'(?<=.{85})<br[^>]*>$'), lambda match : '<p>'),

                  # terminate unterminated lines.
                  # (re.compile(r'(?<!r/>|<p>|/p>|ad>|dy>|ml>)\n'), lambda match : '<br/>\n'),
                  (re.compile(r'(\.|”|\")\s*\n'), lambda match : match.group(1)+'<br/>\n'),
                  
                  ]
Other changes included here:
  • Modified the page number stripping regexp so that it doesn't strip chapter numbers
  • Added a Chapter Detection regexp so the XPATH in the GUI will find chapters surrounded by <h1></h1>
  • Adding a space next to occurrences of <i> so that italics don't run into neighboring words.
  • Changed uses of .*? in tags to [^>]*
  • Attempting to add <br/> to unterminated lines

I've been struggling a bit getting the regexes right, as there is another stage of processing that the html goes through somewhere(lxml??), so I'm working a bit blindly. Chapters are represented very differently at the time the HTML hits the PDFTOHTML function compared to the final output. Also, for some reason adding the chapter detection regexp causes <p></p> to come out differently in the final version, with margin and border settings all to 0 points.

Lastly, I can't seem to get the regex to terminate line endings working. This is the regex I'd like to use:
Code:
(re.compile(r'(?<!r/>|<p>|/p>|ad>|dy>|ml>)\n'), lambda match : '<br/>\n'),
It works perfectly when modifying the final output embedded in the epub, but it causes some corruption when used in PDFTOHTML which prevents conversion to epub.

I'm using this one as an alternate:
Code:
(re.compile(r'(\.|”|\")\s*\n'), lambda match : match.group(1)+'<br/>\n'),
But it only gets 75% of the unterminated lines, and I can't figure out why.
ldolse is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Pdf to Epub conversion - Newby zambosky Calibre 5 06-22-2010 04:12 AM
PDF to EPUB conversion jfontana Calibre 2 03-17-2010 03:09 AM
Help Needed for PDF to Epub Conversion saurabh Morankar ePub 9 12-04-2009 05:10 PM
pdf to epub conversion mediax Sigil 16 11-19-2009 03:48 PM
Help with conversion from PDF to EPUB Fizz Calibre 5 10-25-2009 11:48 AM


All times are GMT -4. The time now is 03:39 PM.


MobileRead.com is a privately owned, operated and funded community.