View Single Post
Old 02-24-2010, 10:48 AM   #1
cian
Enthusiast
cian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enoughcian will become famous soon enough
 
Posts: 46
Karma: 602
Join Date: Oct 2009
Location: Hove, UK
Device: sony prs505
Improving wordwrap for Calibre and new PDF engine

Hi,
I've been playing around with the XML output of PDFtohtml (which is actually very useful) and ended up with a fairly good Python script that will output properly formatted html for most text based PDFs (it will even strip headers/footers if you know their row number), with good tolerance for paragraphs, headers, etc. It doesn't do images, though that's easily doable with a kludge (using the HTML output of PDFtohtml, basically. Not nice but, hey, PDFs aren't nice).

Ideally I'd like to try incorporating this into Calibre's existing PDF importer plugins, but I understand that it is in the process of being rewritten. In which case I'll obviously wait until this is finished.

However, is this new engine going to still be using PDFtohtml, or does it use something else? And am I wasting my time? Or is there still room for improvement :-)
cian is offline   Reply With Quote