MobileRead Forums - View Single Post

roger64 · 06-11-2020, 11:43 AM

@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.

06-11-2020, 11:43 AM	#12
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	@willus Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. . I shall look for a better viewer on Linux... Sumatra works well with Wine. As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results. pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan) post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives. Last edited by roger64; 06-11-2020 at 09:08 PM. Reason: optimize