Quote:
Originally Posted by roger64
@willus
Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .
I shall look for a better viewer on Linux... Sumatra works well with Wine.
As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.
pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)
post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.
|
Just so you know, you can do all of those pre-processing steps directly in k2pdfopt. The -cmax option adjusts contrast, the -as option will auto-straighten / de-skew, the -g option will adjust gamma factor, which can be used to darken the text, and the -bpc option selects bits-per-color. You can set this to 2 for black and white.