Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 02-24-2010, 10:48 AM   #1
cian
Enthusiast
cian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enough
 
Posts: 46
Karma: 602
Join Date: Oct 2009
Location: Hove, UK
Device: sony prs505
Improving wordwrap for Calibre and new PDF engine

Hi,
I've been playing around with the XML output of PDFtohtml (which is actually very useful) and ended up with a fairly good Python script that will output properly formatted html for most text based PDFs (it will even strip headers/footers if you know their row number), with good tolerance for paragraphs, headers, etc. It doesn't do images, though that's easily doable with a kludge (using the HTML output of PDFtohtml, basically. Not nice but, hey, PDFs aren't nice).

Ideally I'd like to try incorporating this into Calibre's existing PDF importer plugins, but I understand that it is in the process of being rewritten. In which case I'll obviously wait until this is finished.

However, is this new engine going to still be using PDFtohtml, or does it use something else? And am I wasting my time? Or is there still room for improvement :-)
cian is offline   Reply With Quote
Old 02-24-2010, 10:58 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It uses a custom pdf to xml engine based on poppler (which is what pdftohtml uses as well). It deals with line wrapping automatically and fixes various shortcoming of pdftohtml like support for the Table of Contents and rotated images.

I'm currently too busy to work on this, so if you want to, be my guest. The code is in calibre/ebooks/pdf

You can invoke it like this

Code:
ebook-convert file.pdf .epub -vvvv --debug-pipeline p --new-pdf-engine
It will error out, but before erroring out, it will create two files in p/input

index.html and index.xml

The XML file is generated by the new engine and the html file is generated from the XML by the code in calibre.ebooks.pdf.reflow

Currently the engine is pretty much done, the code in reflow needs to be completed.
kovidgoyal is offline   Reply With Quote
Advert
Old 02-24-2010, 03:26 PM   #3
cian
Enthusiast
cian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enough
 
Posts: 46
Karma: 602
Join Date: Oct 2009
Location: Hove, UK
Device: sony prs505
Sure does error out :-) But yes, that looks like something I can look at. Not sure if I'll get anything useful done soon, but what I've seen of the code makes sense to me, so maybe.

I take it the message:
"Fontconfig error: Cannot load default config file
mask requested"
Is not terribly important.
cian is offline   Reply With Quote
Old 02-24-2010, 03:31 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
yeah fontconfig errors can be ignored, since the engine doesn't reder the pdf it doesn't need fontconfig.
kovidgoyal is offline   Reply With Quote
Old 04-29-2010, 10:56 PM   #5
Zerocool
Junior Member
Zerocool began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2010
Device: iPad
I realize that the new pdf engine is still under development, but I'm still having issues with converting a pdf that produces improperly wrapped lines in the output. My guess is its likely due to the source pdf have the text blocks the same size as the original book.

Any suggestions on how to fix this?

Heres an example of some generated output:

Code:
Friday, March 27, 2009, was a lovely day in Washington, 
D.C.—but not for the global economy. The U.S. stock 
market had fallen 40 percent in just seven months, while 
</p>
<p>
Total world 
the U.S. economy had lost 4.1 million jobs. 
</p>
<p>
output was shrinking for the first time since World War 
3 
II. 
</p>
<p>
Despite three government bailouts, Citigroup stock was 
</p>
<p>
trading below $3 per share, about 95 percent down from 
</p>
<p>
its peak; stock in Bank of America, which had received 
</p>
<p>
two bailouts, had lost 85 percent of its value. The public 
</p>
I've also uploaded the source pdf here.

The generated output of the new pdf engine is here.


Zerocool is offline   Reply With Quote
Advert
Old 04-29-2010, 11:01 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That's because the block detection algorithm in the new engine needs to be fine tuned. Basically, it's interpreting every line as the start of a new paragraph/block.
kovidgoyal is offline   Reply With Quote
Old 04-30-2010, 01:27 AM   #7
Zerocool
Junior Member
Zerocool began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2010
Device: iPad
I take it theres no way to correct the block detection (in either the old or new pdf engine). Any recommendations on the best way to convert this output to epub?
Zerocool is offline   Reply With Quote
Old 04-30-2010, 01:43 AM   #8
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
You can normally fix this sort of issue by playing with the Line Wrap factor under the PDF conversion settings.
itimpi is offline   Reply With Quote
Old 05-07-2010, 10:43 PM   #9
Zerocool
Junior Member
Zerocool began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2010
Device: iPad
I've tried the full range of settings for Line Wrap Factor, and while it does wrap some paragraphs properly, the majority are still cut off.
Zerocool is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Improving memory PKFFW Lounge 4 10-09-2010 08:03 PM
Troubleshooting Improving PDF readability on Kindle DX kindlematic Amazon Kindle 4 07-26-2010 05:39 PM
Improving the Upload Process =X= Feedback 2 01-16-2009 08:52 PM
Improving pdf display on Sony's Reader. Syldaril Sony Reader 6 10-28-2006 06:09 PM


All times are GMT -4. The time now is 08:24 PM.


MobileRead.com is a privately owned, operated and funded community.