MobileRead Forums - View Single Post - Improving wordwrap for Calibre and new PDF engine

cian · 02-24-2010, 11:48 AM

Hi,
I've been playing around with the XML output of PDFtohtml (which is actually very useful) and ended up with a fairly good Python script that will output properly formatted html for most text based PDFs (it will even strip headers/footers if you know their row number), with good tolerance for paragraphs, headers, etc. It doesn't do images, though that's easily doable with a kludge (using the HTML output of PDFtohtml, basically. Not nice but, hey, PDFs aren't nice).

Ideally I'd like to try incorporating this into Calibre's existing PDF importer plugins, but I understand that it is in the process of being rewritten. In which case I'll obviously wait until this is finished.

However, is this new engine going to still be using PDFtohtml, or does it use something else? And am I wasting my time? Or is there still room for improvement :-)

02-24-2010, 11:48 AM	#1
cian Enthusiast Posts: 46 Karma: 602 Join Date: Oct 2009 Location: Hove, UK Device: sony prs505	Improving wordwrap for Calibre and new PDF engine Hi, I've been playing around with the XML output of PDFtohtml (which is actually very useful) and ended up with a fairly good Python script that will output properly formatted html for most text based PDFs (it will even strip headers/footers if you know their row number), with good tolerance for paragraphs, headers, etc. It doesn't do images, though that's easily doable with a kludge (using the HTML output of PDFtohtml, basically. Not nice but, hey, PDFs aren't nice). Ideally I'd like to try incorporating this into Calibre's existing PDF importer plugins, but I understand that it is in the process of being rewritten. In which case I'll obviously wait until this is finished. However, is this new engine going to still be using PDFtohtml, or does it use something else? And am I wasting my time? Or is there still room for improvement :-)