Improving wordwrap for Calibre and new PDF engine

cian · 02-24-2010, 10:48 AM

Hi,
I've been playing around with the XML output of PDFtohtml (which is actually very useful) and ended up with a fairly good Python script that will output properly formatted html for most text based PDFs (it will even strip headers/footers if you know their row number), with good tolerance for paragraphs, headers, etc. It doesn't do images, though that's easily doable with a kludge (using the HTML output of PDFtohtml, basically. Not nice but, hey, PDFs aren't nice).

Ideally I'd like to try incorporating this into Calibre's existing PDF importer plugins, but I understand that it is in the process of being rewritten. In which case I'll obviously wait until this is finished.

However, is this new engine going to still be using PDFtohtml, or does it use something else? And am I wasting my time? Or is there still room for improvement :-)

kovidgoyal · 02-24-2010, 10:58 AM

It uses a custom pdf to xml engine based on poppler (which is what pdftohtml uses as well). It deals with line wrapping automatically and fixes various shortcoming of pdftohtml like support for the Table of Contents and rotated images.

I'm currently too busy to work on this, so if you want to, be my guest. The code is in calibre/ebooks/pdf

You can invoke it like this

Code:

ebook-convert file.pdf .epub -vvvv --debug-pipeline p --new-pdf-engine

It will error out, but before erroring out, it will create two files in p/input

index.html and index.xml

The XML file is generated by the new engine and the html file is generated from the XML by the code in calibre.ebooks.pdf.reflow

Currently the engine is pretty much done, the code in reflow needs to be completed.

cian · 02-24-2010, 03:26 PM

Sure does error out :-) But yes, that looks like something I can look at. Not sure if I'll get anything useful done soon, but what I've seen of the code makes sense to me, so maybe.

I take it the message:
"Fontconfig error: Cannot load default config file
mask requested"
Is not terribly important.

kovidgoyal · 02-24-2010, 03:31 PM

yeah fontconfig errors can be ignored, since the engine doesn't reder the pdf it doesn't need fontconfig.

Zerocool · 04-29-2010, 10:56 PM

I realize that the new pdf engine is still under development, but I'm still having issues with converting a pdf that produces improperly wrapped lines in the output. My guess is its likely due to the source pdf have the text blocks the same size as the original book.

Any suggestions on how to fix this?

Heres an example of some generated output:

Code:

Friday, March 27, 2009, was a lovely day in Washington, 
D.C.—but not for the global economy. The U.S. stock 
market had fallen 40 percent in just seven months, while 
</p>
<p>
Total world 
the U.S. economy had lost 4.1 million jobs. 
</p>
<p>
output was shrinking for the first time since World War 
3 
II. 
</p>
<p>
Despite three government bailouts, Citigroup stock was 
</p>
<p>
trading below $3 per share, about 95 percent down from 
</p>
<p>
its peak; stock in Bank of America, which had received 
</p>
<p>
two bailouts, had lost 85 percent of its value. The public 
</p>

I've also uploaded the source pdf here.

The generated output of the new pdf engine is here.

kovidgoyal · 04-29-2010, 11:01 PM

That's because the block detection algorithm in the new engine needs to be fine tuned. Basically, it's interpreting every line as the start of a new paragraph/block.

Zerocool · 04-30-2010, 01:27 AM

I take it theres no way to correct the block detection (in either the old or new pdf engine). Any recommendations on the best way to convert this output to epub?

itimpi · 04-30-2010, 01:43 AM

You can normally fix this sort of issue by playing with the Line Wrap factor under the PDF conversion settings.

Zerocool · 05-07-2010, 10:43 PM

I've tried the full range of settings for Line Wrap Factor, and while it does wrap some paragraphs properly, the majority are still cut off.

02-24-2010, 10:48 AM	#1
cian Enthusiast Posts: 46 Karma: 602 Join Date: Oct 2009 Location: Hove, UK Device: sony prs505	Improving wordwrap for Calibre and new PDF engine Hi, I've been playing around with the XML output of PDFtohtml (which is actually very useful) and ended up with a fairly good Python script that will output properly formatted html for most text based PDFs (it will even strip headers/footers if you know their row number), with good tolerance for paragraphs, headers, etc. It doesn't do images, though that's easily doable with a kludge (using the HTML output of PDFtohtml, basically. Not nice but, hey, PDFs aren't nice). Ideally I'd like to try incorporating this into Calibre's existing PDF importer plugins, but I understand that it is in the process of being rewritten. In which case I'll obviously wait until this is finished. However, is this new engine going to still be using PDFtohtml, or does it use something else? And am I wasting my time? Or is there still room for improvement :-)

02-24-2010, 10:58 AM	#2
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It uses a custom pdf to xml engine based on poppler (which is what pdftohtml uses as well). It deals with line wrapping automatically and fixes various shortcoming of pdftohtml like support for the Table of Contents and rotated images. I'm currently too busy to work on this, so if you want to, be my guest. The code is in calibre/ebooks/pdf You can invoke it like this Code: ebook-convert file.pdf .epub -vvvv --debug-pipeline p --new-pdf-engine It will error out, but before erroring out, it will create two files in p/input index.html and index.xml The XML file is generated by the new engine and the html file is generated from the XML by the code in calibre.ebooks.pdf.reflow Currently the engine is pretty much done, the code in reflow needs to be completed.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Improving memory	PKFFW	Lounge	4	10-09-2010 08:03 PM
Troubleshooting Improving PDF readability on Kindle DX	kindlematic	Amazon Kindle	4	07-26-2010 05:39 PM
Improving the Upload Process	=X=	Feedback	2	01-16-2009 08:52 PM
Improving pdf display on Sony's Reader.	Syldaril	Sony Reader	6	10-28-2006 06:09 PM

02-24-2010, 03:26 PM	#3
cian Enthusiast Posts: 46 Karma: 602 Join Date: Oct 2009 Location: Hove, UK Device: sony prs505	Sure does error out :-) But yes, that looks like something I can look at. Not sure if I'll get anything useful done soon, but what I've seen of the code makes sense to me, so maybe. I take it the message: "Fontconfig error: Cannot load default config file mask requested" Is not terribly important.

02-24-2010, 03:31 PM	#4
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	yeah fontconfig errors can be ignored, since the engine doesn't reder the pdf it doesn't need fontconfig.

04-29-2010, 11:01 PM	#6
kovidgoyal creator of calibre Posts: 45,569 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That's because the block detection algorithm in the new engine needs to be fine tuned. Basically, it's interpreting every line as the start of a new paragraph/block.

04-30-2010, 01:27 AM	#7
Zerocool Junior Member Posts: 7 Karma: 10 Join Date: Apr 2010 Device: iPad	I take it theres no way to correct the block detection (in either the old or new pdf engine). Any recommendations on the best way to convert this output to epub?

04-30-2010, 01:43 AM	#8
itimpi Wizard Posts: 4,553 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	You can normally fix this sort of issue by playing with the Line Wrap factor under the PDF conversion settings.

05-07-2010, 10:43 PM	#9
Zerocool Junior Member Posts: 7 Karma: 10 Join Date: Apr 2010 Device: iPad	I've tried the full range of settings for Line Wrap Factor, and while it does wrap some paragraphs properly, the majority are still cut off.

Advert

Advert