View Single Post
Old 03-12-2018, 10:10 PM   #1528
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,303
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MarjaE View Post
I've been experimenting with different ocr tools: the built-in ocr in k2pdfopt, Elucidate, and ocrmypdf.

All these implement Tesseract. But the k2pdfopt version often misses text which the other versions convert.

Unfortunately, ocring in either Elucidate, or ocrmypdf; and then converting in either k2pdfopt, or Ghostscript; often leads to an unreadable mess.

Is there any way to ocr and convert in k2pdfopt, while getting the ocr quality of the other ones which implement Tesseract? After setting up the tessadata folder, is it just a matter of downloading from tessdata-best, instead of just tessdata?
The issue is that k2pdfopt uses its own algorithms to find words in the document, and then it passes only single words to Tesseract for OCR. The other two programs, I'm guessing, use Tesseract's own algorithms to find the words in the document. Presently k2pdfopt does not have a way to use Tesseract's word-finding algorithms, so I'd think your best bet would be to use the other programs to do the OCR first and then process the OCR'd result with k2pdfopt (which you said gives you an unreadable mess). It would help if you could post a file that you OCR'd with elucidate or ocrmypdf so I could try out k2pdfopt on it myself. I presume that, as before, you are working with Russian (Cyrillic) documents?

Edit: Please run Elucidate or ocrmypdf on the attached document and post the resulting PDF.
Attached Files
File Type: pdf cyrillic2.pdf (3.08 MB, 192 views)
File Type: pdf cyrillic2_corrected_page2.pdf (3.08 MB, 234 views)

Last edited by willus; 03-17-2018 at 12:26 PM. Reason: Added corrected attachment
willus is offline   Reply With Quote