Quote:
Originally Posted by MarjaE
I've been experimenting with different ocr tools: the built-in ocr in k2pdfopt, Elucidate, and ocrmypdf.
All these implement Tesseract. But the k2pdfopt version often misses text which the other versions convert.
Unfortunately, ocring in either Elucidate, or ocrmypdf; and then converting in either k2pdfopt, or Ghostscript; often leads to an unreadable mess.
Is there any way to ocr and convert in k2pdfopt, while getting the ocr quality of the other ones which implement Tesseract? After setting up the tessadata folder, is it just a matter of downloading from tessdata-best, instead of just tessdata?
|
The issue is that k2pdfopt uses its own algorithms to find words in the document, and then it passes only single words to Tesseract for OCR. The other two programs, I'm guessing, use Tesseract's own algorithms to find the words in the document. Presently k2pdfopt does not have a way to use Tesseract's word-finding algorithms, so I'd think your best bet would be to use the other programs to do the OCR first and then process the OCR'd result with k2pdfopt (which you said gives you an unreadable mess). It would help if you could post a file that you OCR'd with elucidate or ocrmypdf so I could try out k2pdfopt on it myself. I presume that, as before, you are working with Russian (Cyrillic) documents?
Edit: Please run Elucidate or ocrmypdf on the attached document and post the resulting PDF.