View Single Post
Old 07-07-2021, 09:57 AM   #26
Pajamaman
Wizard
Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.Pajamaman ought to be getting tired of karma fortunes by now.
 
Pajamaman's Avatar
 
Posts: 2,861
Karma: 10700629
Join Date: May 2016
Location: Canada
Device: Onyx Nova
Quote:
Originally Posted by kacir View Post
does not format it italics (or bold)...No. I use pdfscissors to pre-format [cut] the pdf for OCR.
Then I use Regular Expressions on a finished text to do some cleanup, including getting rid of page breaks, headers or footers (if the pdfscissors couldn't be used successfully to remove them)
Haven't tried that yet.

I wrote (stole most of the code from stack overflow and similar sites) a bash script that uses imagemagick command to create a bitmap from each pdf page and than runs the bitmap through the tesseract. The image is saved to a ramdisk, so I do not cause unnecessary wear to my SSD.

Not as nice, neat or interactive solution as Fine Reader and similar software such as Recognita or Readiris
Again exactly. If I had to do all that, I just wouldn't OCR. It's too much work for my personal non-professional needs. It would just take me too long to make the tools needed to get the job done, so I wouldn't bother doing the job. I am a tool-user, not a tool-maker.
Pajamaman is offline   Reply With Quote