MobileRead Forums - View Single Post - Pandoc and Tesseract to keep images and TOC

Hitch · 11-26-2018, 09:23 AM

Quote:

Originally Posted by kso

Why don't you try pdftotext, part of xpdf, and a standard application on linux (and probably others). It extracts whatever text is in the pdf and writes it to a plain text file avoiding the OCR/proofreading steps. You can even specify a crop area by giving it top/left coordinate and a width and height of the crop area to work on.

klaus

For the very reasons you mentioned--it exports plain text. I suppose if we received a simple PDF that was relatively plain text, and I didn't mind investing all the time needed to then go in and recode all the text formatting, that might be a way forward. But in our experience--and we've done quite literally thousands of PDF-->ePUB jobs--it takes longer to proof a PDF, line-by-line, and add back in the text formatting, than it does to Scan/OCR the file in the first place and do the work in the order we do it.

In other words, you do not avoid the proofreading step--you actually make it longer/worse, because you have to proof line-by-line, to find italics, bold, underscored text, blockquotes, etc. It's faster and easier to run two PDF Compare functions, to find differences between two PDFs, than it is to have to manually read the source PDF against the (now reformatted) text, to find and replace all text formatting. Laboriously long and tedious work to replace all the formatting, in term of the proofing.

And that assumes that it's something simple, like a novel. Once you move past novels, of course, it gets arithmetically worse.

As I stated in my post, we've tried pretty much every variant. We've tried "save to Word" from within Acrobat. We've tried a few of those "save your PDF to Word!" websites. We've tried many, if not all, of the "PDF2XXXX" programs or apps out there. All of them "work" to some extent or the other, but the bottom line is, for the level of accuracy that we need, as commercial formatters, and the amount of time, the scanning/OCR method still works best, both in terms of time expended and quality of result.

If we only had to do one, once in a while, then doing something like you suggest I suppose makes sense. But we probably have 50-100 PDF-to-ePUB/MOBI projects in production as I type this, and as I said, in our experiments, that's not been viable for us.

Hitch