Quote:
Originally Posted by sp313
If you have a look at books about maths and physics which have their official Kindle versions on Amazon, eg. this one (click the cover to see a preview), the regular text is OCR'd but the mathematical symbols and equations are left intact, as images.
Would it be possible to do something similar with k2pdfopt? Right now the two options I've found are to leave the original text in and deal with a huge PDF file, or to leave only the OCR'd text and lose all equations.
|
The issue here is how to tell k2pdfopt what is to be left as an image and what is to be OCR'd. I don't have an easy way to tell a bitmapped equation from bitmapped regular text. I would have to look more into the Tesseract OCR library. It may have a confidence factor in its OCR conversion, and I could leave regions with low confidence as bitmaps.