Kindle AZW vector data lost upon adding to Calibre

loyola · 07-29-2012, 07:38 AM

I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.) These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher.

So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder. The HTML inside is already bad (before any explicit conversions even!) because the text in this html is based on the OCR layer of the AZW, not on the original vector graphics. (Typically such titles also have JPG/SVG/whatever graphics in various places, and as usual, Calibre imports them perfectly, but this is unrelated) The actual text of the book, in those places where the OCR was not perfect, is mangled and incorrect in the HTMLZ - sometimes severe misprints. Since mobi/epub/html do not support (as far as I know) this kind of format structure (i.e., ALL the text is vector graphics, with an invisible for-search-only OCR layer of text) I was wondering if it is possible to convert the book somehow so that this information (the correct text) is preserved. There is one, albeit dreadful, format which supports this weird way of saving ebooks...PDF. I've seen many scanned books that were saved as vector graphics in a PDF (instead of just leaving the scan as monochrome tifs and converting to DJVU) - I never understood why people do this, but apparently Amazon do too.

Calibre, as it stands, cannot properly handle all those AZW titles that are saved in this weird way (typically these are scientific books, because the OCR cannot handle the special symbols, so they save all the text as vector so that nothing is lost in the final product - only the search will not be perfect).

[edit:] I should have added that in such books it is not the case that each page is one big vector image. Rather, each character is saved as a vector image in itself and the AZW format coordinates these images into strings of words, spaces, etc (but not in the way that DJVU saves pagewise coordinates for each character. In these weird AZW vector-text books one can, for example, change the width of the rows by resizing the reader window and everything shuffles around just like "good" regular font-based-text ebooks. One can see this by observing that there are several different copies of each letter, and that it just looks like a scan (some letters fused together, some ink noise etc.) And it is definitely vector (seen by zooming in the AZW reader).

kovidgoyal · 07-29-2012, 08:29 AM

The format is called topaz and calibre does not ahndle it at all. You have installed some third party dedrm plugins that convert the topaz to html for you.

DoctorOhh · 07-29-2012, 11:39 PM

Quote:

Originally Posted by loyola

I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.)

The format is referred to as Topaz and these books view perfectly in Kindle readers and Kindle apps.

Quote:

Originally Posted by loyola

These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher.

This is not Amazon being lazy. For the most part these books do not have digital data. Amazon is contracted to scan the books and turn them into digital. Amazon does a great job of replicating the book in a form the Kindle can handle. You can read up on the creation of topaz from the person (screen-name: Fluffy) who created it in posts 800-812 and on his blog here. Very interesting read.

Quote:

Originally Posted by loyola

So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder.

As pointed out that is the result of a third party plugin. The same set of tools from Apprentice Alf's site used apart from calibre should also create a perfect html version of the book using the svg glyphs that can be viewed in Firefox. You can use Sigil (to view the OCR version) and you can use the svg html version in Firefox to do a A > B compare with the OCR version to correct any formatting errors.

Anyway you view it Topaz formatted books should be handled outside of calibre using Sigil and the original glyphs html to spell check and error check the conversion before adding the book to calibre.

loyola · 08-04-2012, 11:19 AM

Thanks! I was not aware of the existence of this format. Converting to PDF via Calibre is a good solution (I want to avoid the annoying proprietary viewer).

07-29-2012, 07:38 AM	#1
loyola Junior Member Posts: 2 Karma: 10 Join Date: Jul 2012 Device: Kindle	Kindle AZW vector data lost upon adding to Calibre I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.) These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher. So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder. The HTML inside is already bad (before any explicit conversions even!) because the text in this html is based on the OCR layer of the AZW, not on the original vector graphics. (Typically such titles also have JPG/SVG/whatever graphics in various places, and as usual, Calibre imports them perfectly, but this is unrelated) The actual text of the book, in those places where the OCR was not perfect, is mangled and incorrect in the HTMLZ - sometimes severe misprints. Since mobi/epub/html do not support (as far as I know) this kind of format structure (i.e., ALL the text is vector graphics, with an invisible for-search-only OCR layer of text) I was wondering if it is possible to convert the book somehow so that this information (the correct text) is preserved. There is one, albeit dreadful, format which supports this weird way of saving ebooks...PDF. I've seen many scanned books that were saved as vector graphics in a PDF (instead of just leaving the scan as monochrome tifs and converting to DJVU) - I never understood why people do this, but apparently Amazon do too. Calibre, as it stands, cannot properly handle all those AZW titles that are saved in this weird way (typically these are scientific books, because the OCR cannot handle the special symbols, so they save all the text as vector so that nothing is lost in the final product - only the search will not be perfect). [edit:] I should have added that in such books it is not the case that each page is one big vector image. Rather, each character is saved as a vector image in itself and the AZW format coordinates these images into strings of words, spaces, etc (but not in the way that DJVU saves pagewise coordinates for each character. In these weird AZW vector-text books one can, for example, change the width of the rows by resizing the reader window and everything shuffles around just like "good" regular font-based-text ebooks. One can see this by observing that there are several different copies of each letter, and that it just looks like a scan (some letters fused together, some ink noise etc.) And it is definitely vector (seen by zooming in the AZW reader). Last edited by loyola; 07-29-2012 at 07:57 AM. Reason: correction

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Creating Kindle Collections from Calibre's Data	mornington	Devices	58	11-26-2013 08:00 PM
lost data	milaklay	Apple Devices	1	01-28-2011 02:53 PM
when will calibre support vector graphics in pdf to epub conversion	smith9	Calibre	5	11-13-2010 05:03 AM
lost the collection data on prs505	smokey	Sony Reader	3	02-06-2009 07:15 PM

07-29-2012, 08:29 AM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The format is called topaz and calibre does not ahndle it at all. You have installed some third party dedrm plugins that convert the topaz to html for you.

08-04-2012, 11:19 AM	#4
loyola Junior Member Posts: 2 Karma: 10 Join Date: Jul 2012 Device: Kindle	Thanks! I was not aware of the existence of this format. Converting to PDF via Calibre is a good solution (I want to avoid the annoying proprietary viewer).

Advert