![]() |
#1 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() Posts: 46
Karma: 602
Join Date: Oct 2009
Location: Hove, UK
Device: sony prs505
|
Improving wordwrap for Calibre and new PDF engine
Hi,
I've been playing around with the XML output of PDFtohtml (which is actually very useful) and ended up with a fairly good Python script that will output properly formatted html for most text based PDFs (it will even strip headers/footers if you know their row number), with good tolerance for paragraphs, headers, etc. It doesn't do images, though that's easily doable with a kludge (using the HTML output of PDFtohtml, basically. Not nice but, hey, PDFs aren't nice). Ideally I'd like to try incorporating this into Calibre's existing PDF importer plugins, but I understand that it is in the process of being rewritten. In which case I'll obviously wait until this is finished. However, is this new engine going to still be using PDFtohtml, or does it use something else? And am I wasting my time? Or is there still room for improvement :-) |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It uses a custom pdf to xml engine based on poppler (which is what pdftohtml uses as well). It deals with line wrapping automatically and fixes various shortcoming of pdftohtml like support for the Table of Contents and rotated images.
I'm currently too busy to work on this, so if you want to, be my guest. The code is in calibre/ebooks/pdf You can invoke it like this Code:
ebook-convert file.pdf .epub -vvvv --debug-pipeline p --new-pdf-engine index.html and index.xml The XML file is generated by the new engine and the html file is generated from the XML by the code in calibre.ebooks.pdf.reflow Currently the engine is pretty much done, the code in reflow needs to be completed. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() Posts: 46
Karma: 602
Join Date: Oct 2009
Location: Hove, UK
Device: sony prs505
|
Sure does error out :-) But yes, that looks like something I can look at. Not sure if I'll get anything useful done soon, but what I've seen of the code makes sense to me, so maybe.
I take it the message: "Fontconfig error: Cannot load default config file mask requested" Is not terribly important. |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
yeah fontconfig errors can be ignored, since the engine doesn't reder the pdf it doesn't need fontconfig.
|
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Apr 2010
Device: iPad
|
I realize that the new pdf engine is still under development, but I'm still having issues with converting a pdf that produces improperly wrapped lines in the output. My guess is its likely due to the source pdf have the text blocks the same size as the original book.
Any suggestions on how to fix this? Heres an example of some generated output: Code:
Friday, March 27, 2009, was a lovely day in Washington, D.C.—but not for the global economy. The U.S. stock market had fallen 40 percent in just seven months, while </p> <p> Total world the U.S. economy had lost 4.1 million jobs. </p> <p> output was shrinking for the first time since World War 3 II. </p> <p> Despite three government bailouts, Citigroup stock was </p> <p> trading below $3 per share, about 95 percent down from </p> <p> its peak; stock in Bank of America, which had received </p> <p> two bailouts, had lost 85 percent of its value. The public </p> The generated output of the new pdf engine is here. ![]() |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's because the block detection algorithm in the new engine needs to be fine tuned. Basically, it's interpreting every line as the start of a new paragraph/block.
|
![]() |
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Apr 2010
Device: iPad
|
I take it theres no way to correct the block detection (in either the old or new pdf engine). Any recommendations on the best way to convert this output to epub?
|
![]() |
![]() |
![]() |
#8 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
|
You can normally fix this sort of issue by playing with the Line Wrap factor under the PDF conversion settings.
|
![]() |
![]() |
![]() |
#9 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Apr 2010
Device: iPad
|
I've tried the full range of settings for Line Wrap Factor, and while it does wrap some paragraphs properly, the majority are still cut off.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Improving memory | PKFFW | Lounge | 4 | 10-09-2010 08:03 PM |
Troubleshooting Improving PDF readability on Kindle DX | kindlematic | Amazon Kindle | 4 | 07-26-2010 05:39 PM |
Improving the Upload Process | =X= | Feedback | 2 | 01-16-2009 08:52 PM |
Improving pdf display on Sony's Reader. | Syldaril | Sony Reader | 6 | 10-28-2006 06:09 PM |