Disclaimer: I use Tesseract myself [on a Mint Linux computer] for an occasional OCR of a book that I have in pdf and want to read on my e-ink reader.
Quote:
Originally Posted by Sarmat89
It does diacritics?
|
Yes, it does. You need to tell it what the language is.
Quote:
Originally Posted by Sarmat89
It does italics?
|
It recognizes the text, but does not format it italics (or bold). This is the biggest shortcoming, IMHO.
Quote:
Originally Posted by Sarmat89
It strips headers/footers?
|
No. I use pdfscissors to pre-format [cut] the pdf for OCR.
Then I use Regular Expressions on a finished text to do some cleanup, including getting rid of page breaks, headers or footers (if the pdfscissors couldn't be used successfully to remove them)
Quote:
Originally Posted by Sarmat89
It recognizes custom words?
|
Haven't tried that yet.
I wrote (stole most of the code from stack overflow and similar sites) a bash script that uses imagemagick command to create a bitmap from each pdf page and than runs the bitmap through the tesseract. The image is saved to a ramdisk, so I do not cause unnecessary wear to my SSD.
Not as nice, neat or interactive solution as Fine Reader and similar software such as Recognita or Readiris (I used all of them on Windows at work), but good enough for my needs at home. I would not be willing to fork over money for Fine Reader for my very limited use, and this way I do not need to use pirated software.