OK Kovid, I'd like to confirm a couple of things with you, please
The new pdf engine:
1. Takes the pdf file, and passes it to the C plugin implementation of PDF reflow. That returns an xml with the pdf's draw commands (a pdf in xml if you will)
2. PDFDocument takes the xml and generates the html that's used as a base for conversion
3. The rest of ebook conversion takes the html into whatever other format is needed
My plan would then be to hack into PDFDocument, take the xml, do the unwrapping and header+footer detection, and end up making the html there.
Is that what you had in mind? Or did you intend the reflow plugin to, you know, reflow (ie unwrap) the pdf? I personally prefer pdfreflow being a pdf-to-xml-that-we-can-work-on-in-python converter
Did I get it right? What did you have in mind?
Thanks!
|