MobileRead Forums - View Single Post

KevinH · 06-24-2011, 11:34 AM

Hi,

Quote:

Originally Posted by ATimson

Topaz is designed to be an easy format for scanned books - it's got the images on the page, combined with a often-not-very-good OCR'd copy of the text for searching. Those scripts only get you that OCR'd text.

If your resulting book was the same, with no blatant OCR errors and with formatting intact, you're lucky.

That is not quite correct. Using the calibre plugin tool will only get you the OCR'd text version because a plugin can only pass along one type of ebook not two. So it passes along the html version so that you can convert from htmlz to epub and then on to whatever other format you like after fixing any errors that bother you.

The other tools (KindleBooks, DeDRM) in the "tools" can provide you with the OCR'd html plus the complete set of page images (exact copies) written out as svg images embedded in xhtml pages so that you can read the book with any modern browser that understands svg images (read that Safari, Firefox 4, FireFox 5, etc).

You can also easily modify the tool to not imbed the svg in xhtml and instead create pure svg images (one per page) and then can convert the book to an exact set of png or jpeg images easily or create an image only pdf file (it will be quite large!, and then use Acrobat Pro to OCR it yourself to make it searchable).

Until someone can create some sort of svg glyph to outline font character recognition program, there is no other way to deal with the issue.

Hope this makes things clearer.