MobileRead Forums - View Single Post

Pajamaman · 07-07-2021, 10:57 AM

Quote:

Originally Posted by kacir

does not format it italics (or bold)...No. I use pdfscissors to pre-format [cut] the pdf for OCR.
Then I use Regular Expressions on a finished text to do some cleanup, including getting rid of page breaks, headers or footers (if the pdfscissors couldn't be used successfully to remove them)
Haven't tried that yet.

I wrote (stole most of the code from stack overflow and similar sites) a bash script that uses imagemagick command to create a bitmap from each pdf page and than runs the bitmap through the tesseract. The image is saved to a ramdisk, so I do not cause unnecessary wear to my SSD.

Not as nice, neat or interactive solution as Fine Reader and similar software such as Recognita or Readiris

Again exactly. If I had to do all that, I just wouldn't OCR. It's too much work for my personal non-professional needs. It would just take me too long to make the tools needed to get the job done, so I wouldn't bother doing the job. I am a tool-user, not a tool-maker.