MobileRead Forums - View Single Post - Can you OCR the images inside of .pdf files?

Tex2002ans · 09-13-2014, 03:34 AM

Quote:

Originally Posted by shevirsy

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?

I already linked to a Wikipedia article showing off a comparison of many different OCR programs in Post #13 right in this topic:

https://www.mobileread.com/forums/sho...2&postcount=13

Here is the Wikipedia link again:

https://en.wikipedia.org/wiki/Compar...ition_software

Most likely the only free OCR of note would be Tesseract (and most of the Free OCR programs out there would use (most likely an outdated) version of Tesseract in the backend).

I already explained many of the disadvantages of the free solutions above. Although you are free to read the Tesseract documentation and do much of the training/tweaking needed.

I personally would just err on the side of the paid OCR programs, ESPECIALLY when dealing with non-English works, or works with lots of accented characters. While the proprietary OCR programs are not zero dollars initially, they would save you A TON of time in all of your post-OCR processing steps (which is where you WILL spend most of your time). The more accurate/clean you can get your input, you will have to spend MUCH less time cleaning, and getting the document into a readable state.

Besides that, you can use GIMP/Inkscape/Imagemagick in order to manipulate the images fine.

I prefer using all free software over proprietary whenever I can, but sadly, OCR is just one area where the free solutions don't hold much of a candle.