MobileRead Forums - View Single Post - k2pdfopt: optimizes PDFs for viewing on e-readers

willus · 10-26-2015, 11:09 PM

Quote:

Originally Posted by crankypants

Google books often has PDF files which are just a set of images of scans from an old book. Does this software convert those scanned images (inside the PDF) to text or EPUB? Calibre does this but only with 98% accuracy and Calibre doesn't support ligatures (like "if" next to each other which then becomes one electronic character). ...

K2pdfopt generates PDF output only, but it will add an OCR layer to the scanned text, or you can output the OCR'd text directly to an ASCII text file. It uses the Tesseract OCR engine, so that will govern its accuracy. I don't know if it is better than calibre--I'm not sure which OCR engine calibre uses. I'm also not sure what you mean by "supports ligatures." Do you mean you want it to generate a special "ligature" character code, or you want it to correctly break ligatures into their two separate letters? To be honest, I don't recall Tesseract's behavior on ligatures at the moment, either way. It's easy enough to try it out.

PS. Are you sure calibre is doing the OCR and the OCR layer isn't already in the scanned file? As far as I can tell, calibre does not have integrated OCR capability unless you are using it with a third-party tool. If the OCR is in the scanned file, it's probably done with Tesseract already, since Tesseract is supported by Google.