MobileRead Forums - View Single Post - Can you OCR the images inside of .pdf files?

Tex2002ans · 07-28-2014, 12:00 PM

Quote:

Originally Posted by u238110

Tex2002ans, what would you say is the best guide on how to handle these complicated elements? For EPUB...

Also toss images + tables onto that "complex list" too.

Finereader does a pretty decent job at separating images from the text, and it is pretty dang good at figuring out tables. (Let me tell you, doing tables manually will make you want to kill yourself

).

Here is a list of a bunch of different OCR programs: https://en.wikipedia.org/wiki/Compar...ition_software

There isn't really a "guide", just that in my experience, the Free OCR tools (Tesseract, FreeOCR, etc. etc.), do not recognizing a lot of that "complex" formatting as accurately as something like Finereader.

And it is exactly as Toxaris stated:

Quote:

Originally Posted by Toxaris

By checking. Some of them are handled correctly by the OCR, but not all. That is one of the reasons that I created my add-on to automate a lot of tasks to fix these (and a lot of other) OCR mistakes. In case of doubt, manual intervention is required.

There is just nothing you can do besides manual checking/fixing. PDF was designed as a final/output format, and is dreadful as an INPUT format.

Also, another disadvantage of the free stuff, you are most likely going to have to do A LOT of your own training. For example, here is the training manual for Tesseract:

https://code.google.com/p/tesseract-...ningTesseract3

While the default training included with the program probably works perfectly fine for basic things like novels, and cleaner scans, it will probably require more training if the book you are dealing with has older/more obscure fonts, or when dealing with non-English languages. (Even a lot of "English" books have a lot of accented characters, and letters out of the usual A-Z subset).

In Finereader, you are also paying for the massive amount of training that THEY have already done for you (on the millions and millions of documents they process). This again, will lead to more accurate results than otherwise.

Remember, the more accurate the OCR is, the less time you have to spend actually cleaning up the wrong output.

So with free, sure, it might cost you $0 initially, but then you spend many more hours double-checking/cleaning up the output.

Edit: Actually, now that I reread u238110's post, he MAY have meant how I handle coding those things in actual EPUB.

I explained Tables/Footnotes/Formulas/Figures/Images towards the bottom of this post (with links to the specific topics/real-life examples):

https://www.mobileread.com/forums/sho...68&postcount=8

Headers/Footers can just be trashed. Finereader does a good job at recognizing them in the document, and just allows you to easily export without those included (again, this is an area where the free stuff might lack, and you would have to spend time manually removing).