View Single Post
Old 07-07-2021, 09:21 AM   #22
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,463
Karma: 10684861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Disclaimer: I use Tesseract myself [on a Mint Linux computer] for an occasional OCR of a book that I have in pdf and want to read on my e-ink reader.
Quote:
Originally Posted by Sarmat89 View Post
It does diacritics?
Yes, it does. You need to tell it what the language is.
Quote:
Originally Posted by Sarmat89 View Post
It does italics?
It recognizes the text, but does not format it italics (or bold). This is the biggest shortcoming, IMHO.
Quote:
Originally Posted by Sarmat89 View Post
It strips headers/footers?
No. I use pdfscissors to pre-format [cut] the pdf for OCR.
Then I use Regular Expressions on a finished text to do some cleanup, including getting rid of page breaks, headers or footers (if the pdfscissors couldn't be used successfully to remove them)
Quote:
Originally Posted by Sarmat89 View Post
It recognizes custom words?
Haven't tried that yet.

I wrote (stole most of the code from stack overflow and similar sites) a bash script that uses imagemagick command to create a bitmap from each pdf page and than runs the bitmap through the tesseract. The image is saved to a ramdisk, so I do not cause unnecessary wear to my SSD.

Not as nice, neat or interactive solution as Fine Reader and similar software such as Recognita or Readiris (I used all of them on Windows at work), but good enough for my needs at home. I would not be willing to fork over money for Fine Reader for my very limited use, and this way I do not need to use pirated software.

Last edited by kacir; 07-07-2021 at 09:26 AM.
kacir is offline   Reply With Quote