MobileRead Forums - View Single Post - Pandoc and Tesseract to keep images and TOC

Hitch · 11-24-2018, 11:34 AM

Quote:

Originally Posted by j.p.s

Hi gg4u,

In addition to marking the title and chapter heading with "=" characters, it is necessary to insert references to images yourself. asciidoc is best for quickly making a nice finished document in multiple formats starting from nothing. It was my thought that it could help with a part of your process.

Discussions of all kinds of subjects on mobileread can be very contentious, but across all the various forums on mobileread there is widespread agreement that conversion from PDF to any other format has all kinds of problems and that there is no good way to automate it.

(Bold emphasis added)

And that's the bottom line. There is, quite simply, NO GOOD WAY to automate conversion from PDF.

We do this professionally--let me tell you what we do, after hundreds of experiments and thousands of books:

We scan the PDF using AbbyyFineReader;
We run OCR;
We clean the resulting Word file generated by Abbyy, using the red warning indicators as a guide.
We export a PDF from the cleaned Word file, and,
We run a compare against the original PDF.
We fix any differences between the two, in the Word file.
We then do a 2nd export to PDF, and lather-rinse-repeat with the PDF compare.
When we have a "perfect" pair of PDFs, then we stop with the OCR/Scan.
We then clean the Word file as we would from a typical source Word file, which means,
We're at the same exact spot we would have been, if a client had walked in the door with a Word file to begin with.

That's what we do. We've tried EVERY possible automated process, from those suggested by others, to some we've devised and created ourselves. This is the fastest, most accurate way we've found. I wish it weren't so, but this is the bottom line.

FWIW. I know it's not what you wanted to hear, but...there it is.

Hitch