View Single Post
Old 06-12-2020, 06:23 AM   #13
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by roger64 View Post
@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.
Just so you know, you can do all of those pre-processing steps directly in k2pdfopt. The -cmax option adjusts contrast, the -as option will auto-straighten / de-skew, the -g option will adjust gamma factor, which can be used to darken the text, and the -bpc option selects bits-per-color. You can set this to 2 for black and white.
willus is offline   Reply With Quote