View Single Post
Old 11-01-2015, 10:56 AM   #1207
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by willus View Post
K2pdfopt generates PDF output only, but it will add an OCR layer to the scanned text, or you can output the OCR'd text directly to an ASCII text file. It uses the Tesseract OCR engine, so that will govern its accuracy. I don't know if it is better than calibre--I'm not sure which OCR engine calibre uses. I'm also not sure what you mean by "supports ligatures." Do you mean you want it to generate a special "ligature" character code, or you want it to correctly break ligatures into their two separate letters? To be honest, I don't recall Tesseract's behavior on ligatures at the moment, either way. It's easy enough to try it out.

PS. Are you sure calibre is doing the OCR and the OCR layer isn't already in the scanned file? As far as I can tell, calibre does not have integrated OCR capability unless you are using it with a third-party tool. If the OCR is in the scanned file, it's probably done with Tesseract already, since Tesseract is supported by Google.
Yup -- calibre will rely on existing OCR in the file, but otherwise simply adds the images themselves.
eschwartz is offline   Reply With Quote