MobileRead Forums - View Single Post - Kindle AZW vector data lost upon adding to Calibre

loyola · 07-29-2012, 08:38 AM

I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.) These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher.

So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder. The HTML inside is already bad (before any explicit conversions even!) because the text in this html is based on the OCR layer of the AZW, not on the original vector graphics. (Typically such titles also have JPG/SVG/whatever graphics in various places, and as usual, Calibre imports them perfectly, but this is unrelated) The actual text of the book, in those places where the OCR was not perfect, is mangled and incorrect in the HTMLZ - sometimes severe misprints. Since mobi/epub/html do not support (as far as I know) this kind of format structure (i.e., ALL the text is vector graphics, with an invisible for-search-only OCR layer of text) I was wondering if it is possible to convert the book somehow so that this information (the correct text) is preserved. There is one, albeit dreadful, format which supports this weird way of saving ebooks...PDF. I've seen many scanned books that were saved as vector graphics in a PDF (instead of just leaving the scan as monochrome tifs and converting to DJVU) - I never understood why people do this, but apparently Amazon do too.

Calibre, as it stands, cannot properly handle all those AZW titles that are saved in this weird way (typically these are scientific books, because the OCR cannot handle the special symbols, so they save all the text as vector so that nothing is lost in the final product - only the search will not be perfect).

[edit:] I should have added that in such books it is not the case that each page is one big vector image. Rather, each character is saved as a vector image in itself and the AZW format coordinates these images into strings of words, spaces, etc (but not in the way that DJVU saves pagewise coordinates for each character. In these weird AZW vector-text books one can, for example, change the width of the rows by resizing the reader window and everything shuffles around just like "good" regular font-based-text ebooks. One can see this by observing that there are several different copies of each letter, and that it just looks like a scan (some letters fused together, some ink noise etc.) And it is definitely vector (seen by zooming in the AZW reader).

07-29-2012, 08:38 AM	#1
loyola Junior Member Posts: 2 Karma: 10 Join Date: Jul 2012 Device: Kindle	Kindle AZW vector data lost upon adding to Calibre I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.) These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher. So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder. The HTML inside is already bad (before any explicit conversions even!) because the text in this html is based on the OCR layer of the AZW, not on the original vector graphics. (Typically such titles also have JPG/SVG/whatever graphics in various places, and as usual, Calibre imports them perfectly, but this is unrelated) The actual text of the book, in those places where the OCR was not perfect, is mangled and incorrect in the HTMLZ - sometimes severe misprints. Since mobi/epub/html do not support (as far as I know) this kind of format structure (i.e., ALL the text is vector graphics, with an invisible for-search-only OCR layer of text) I was wondering if it is possible to convert the book somehow so that this information (the correct text) is preserved. There is one, albeit dreadful, format which supports this weird way of saving ebooks...PDF. I've seen many scanned books that were saved as vector graphics in a PDF (instead of just leaving the scan as monochrome tifs and converting to DJVU) - I never understood why people do this, but apparently Amazon do too. Calibre, as it stands, cannot properly handle all those AZW titles that are saved in this weird way (typically these are scientific books, because the OCR cannot handle the special symbols, so they save all the text as vector so that nothing is lost in the final product - only the search will not be perfect). [edit:] I should have added that in such books it is not the case that each page is one big vector image. Rather, each character is saved as a vector image in itself and the AZW format coordinates these images into strings of words, spaces, etc (but not in the way that DJVU saves pagewise coordinates for each character. In these weird AZW vector-text books one can, for example, change the width of the rows by resizing the reader window and everything shuffles around just like "good" regular font-based-text ebooks. One can see this by observing that there are several different copies of each letter, and that it just looks like a scan (some letters fused together, some ink noise etc.) And it is definitely vector (seen by zooming in the AZW reader). Last edited by loyola; 07-29-2012 at 08:57 AM. Reason: correction