Quote:
Originally Posted by Difermo
I do not have original pdf of all books. I'm taking pictures of them. So to be able to select text, they must be OCR.
I think OCR will be very bad and destroy lines tables etc. So the work to fix all will probably be huge.
|
As PeterT said, you could create a PDF with the image layer on top and the invisible text layer (OCR) on the bottom.
For example, that is how you can search through all the books on Archive.org:
https://archive.org/details/engineeringbook00yeom
The most accurate Open Source program is probably tesseract:
https://github.com/tesseract-ocr/tesseract
but it is commandline only (there are a few programs based off of it that do have a GUI).
I haven't tested it in years, but last I tested there was serious inaccuracies with Formatted Text (carrying over Italics/Bold/Smallcaps/Superscript/Subscript) and you had to do a ton of finagling with dictionaries + training. I also have no idea how well it handles complex formatting like Tables or Charts/Graphs with captions.
The most accurate Proprietary OCR is ABBYY Finereader (this is what I use):
https://www.abbyy.com/en-us/finereader/
It costs a bit of money ($199 for the latest version), but if you value your time, it will save you A TON of headaches.
The examples you gave of written Maths or complex equations is just not going to work well with ANY OCR programs... but at least you would be able to have all of the normal text in a book OCRed/searchable + accurate. :P
Quote:
Originally Posted by Difermo
I'm still searching the best way to create PDF from pictures. They are not all same size since hand is not always on same distance. I will have to make some diy book scanner
|
The worse your input, the worse the OCR... and the worse your output will be.
Taking pictures with your shaky hand/phone is not ideal because you would most likely get very fuzzy text. This is ok if you are a human trying to quickly read the image, but
disastrous for OCR.
The DIY Book Scanner forums discusses quite a few designs people have rigged up + their workflows:
https://forum.diybookscanner.org/
and we also discussed quite a lot of this in the topic, "Delicate text digitalizing + scanning issues":
https://www.mobileread.com/forums/sh...d.php?t=234146