View Single Post
Old 03-06-2015, 08:36 AM   #1016
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,305
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by sp313 View Post
If you have a look at books about maths and physics which have their official Kindle versions on Amazon, eg. this one (click the cover to see a preview), the regular text is OCR'd but the mathematical symbols and equations are left intact, as images.

Would it be possible to do something similar with k2pdfopt? Right now the two options I've found are to leave the original text in and deal with a huge PDF file, or to leave only the OCR'd text and lose all equations.
The issue here is how to tell k2pdfopt what is to be left as an image and what is to be OCR'd. I don't have an easy way to tell a bitmapped equation from bitmapped regular text. I would have to look more into the Tesseract OCR library. It may have a confidence factor in its OCR conversion, and I could leave regions with low confidence as bitmaps.
willus is offline   Reply With Quote