Request/Idea: Approach to converting complex documents like PDFs
I am often in need of converting complex documents to EPUB. They are so heavily formatted that no amount of automation is going to give acceptable results. The only acceptable output in these cases is to convert each page into an image. I have been using Dongsoft PDF to EPUB Converter (yes, I said Dongsoft) because they will do the image conversion, retain the table of contents of the PDF, and convert to a fixed layout EPUB in one process.
I can do relatively the same thing in Calibre but I need to convert the pages to images, zip the folder, rename to a CBZ, convert to an EPUB, and build the table of contents by hand (although I'm open to anyone who knows a simpler way).
OK, so here is the idea:
Converting to an image is the most reliable method of retaining the formatting of the source document. The only problem with this (other than the file size) is that you can no longer search or highlight. One way around this would be to overlay a layer of transparent text. With a fixed layout EPUB you could replicate the layout of the original PDF fairly precisely. Do you think a similar feature could be added to Calibre?
If it were possible to get pixel perfect overlay of text, it would also be possible to make the text of the PDF transparent before capturing the image of each page and just overlaying normal, opaque, appropriately colored text in the EPUB (but I realize even with scaling font options, this might be unlikely).
Anyway, this would solve a lot of problems converting documents with advanced formatting (including adding some additional options for comics if a good OCR were applied first).
Is this something that could be pursued?
|