View Single Post
Old 03-12-2017, 07:43 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Difermo View Post
I do not have original pdf of all books. I'm taking pictures of them. So to be able to select text, they must be OCR.
I think OCR will be very bad and destroy lines tables etc. So the work to fix all will probably be huge.
As PeterT said, you could create a PDF with the image layer on top and the invisible text layer (OCR) on the bottom.

For example, that is how you can search through all the books on Archive.org:

https://archive.org/details/engineeringbook00yeom

The most accurate Open Source program is probably tesseract:

https://github.com/tesseract-ocr/tesseract

but it is commandline only (there are a few programs based off of it that do have a GUI).

I haven't tested it in years, but last I tested there was serious inaccuracies with Formatted Text (carrying over Italics/Bold/Smallcaps/Superscript/Subscript) and you had to do a ton of finagling with dictionaries + training. I also have no idea how well it handles complex formatting like Tables or Charts/Graphs with captions.

The most accurate Proprietary OCR is ABBYY Finereader (this is what I use):

https://www.abbyy.com/en-us/finereader/

It costs a bit of money ($199 for the latest version), but if you value your time, it will save you A TON of headaches.

The examples you gave of written Maths or complex equations is just not going to work well with ANY OCR programs... but at least you would be able to have all of the normal text in a book OCRed/searchable + accurate. :P

Quote:
Originally Posted by Difermo View Post
I'm still searching the best way to create PDF from pictures. They are not all same size since hand is not always on same distance. I will have to make some diy book scanner
The worse your input, the worse the OCR... and the worse your output will be.

Taking pictures with your shaky hand/phone is not ideal because you would most likely get very fuzzy text. This is ok if you are a human trying to quickly read the image, but disastrous for OCR.

The DIY Book Scanner forums discusses quite a few designs people have rigged up + their workflows:

https://forum.diybookscanner.org/

and we also discussed quite a lot of this in the topic, "Delicate text digitalizing + scanning issues":

https://www.mobileread.com/forums/sh...d.php?t=234146
Tex2002ans is offline   Reply With Quote