Quote:
Originally Posted by loyola
I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.)
|
The format is referred to as Topaz and these books view perfectly in Kindle readers and Kindle apps.
Quote:
Originally Posted by loyola
These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher.
|
This is not Amazon being lazy. For the most part these books do not have digital data. Amazon is contracted to scan the books and turn them into digital. Amazon does a great job of replicating the book in a form the Kindle can handle. You can read up on the creation of topaz from the person (screen-name: Fluffy) who created it in
posts 800-812 and on
his blog here. Very interesting read.
Quote:
Originally Posted by loyola
So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder.
|
As pointed out that is the result of a third party plugin. The same set of tools from Apprentice Alf's site used apart from calibre should also create a perfect html version of the book using the svg glyphs that can be viewed in Firefox. You can use Sigil (to view the OCR version) and you can use the svg html version in Firefox to do a A > B compare with the OCR version to correct any formatting errors.
Anyway you view it Topaz formatted books should be handled outside of calibre using Sigil and the original glyphs html to spell check and error check the conversion before adding the book to calibre.