07-29-2012, 07:38 AM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jul 2012
Device: Kindle
|
Kindle AZW vector data lost upon adding to Calibre
I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.) These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher.
So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder. The HTML inside is already bad (before any explicit conversions even!) because the text in this html is based on the OCR layer of the AZW, not on the original vector graphics. (Typically such titles also have JPG/SVG/whatever graphics in various places, and as usual, Calibre imports them perfectly, but this is unrelated) The actual text of the book, in those places where the OCR was not perfect, is mangled and incorrect in the HTMLZ - sometimes severe misprints. Since mobi/epub/html do not support (as far as I know) this kind of format structure (i.e., ALL the text is vector graphics, with an invisible for-search-only OCR layer of text) I was wondering if it is possible to convert the book somehow so that this information (the correct text) is preserved. There is one, albeit dreadful, format which supports this weird way of saving ebooks...PDF. I've seen many scanned books that were saved as vector graphics in a PDF (instead of just leaving the scan as monochrome tifs and converting to DJVU) - I never understood why people do this, but apparently Amazon do too. Calibre, as it stands, cannot properly handle all those AZW titles that are saved in this weird way (typically these are scientific books, because the OCR cannot handle the special symbols, so they save all the text as vector so that nothing is lost in the final product - only the search will not be perfect). [edit:] I should have added that in such books it is not the case that each page is one big vector image. Rather, each character is saved as a vector image in itself and the AZW format coordinates these images into strings of words, spaces, etc (but not in the way that DJVU saves pagewise coordinates for each character. In these weird AZW vector-text books one can, for example, change the width of the rows by resizing the reader window and everything shuffles around just like "good" regular font-based-text ebooks. One can see this by observing that there are several different copies of each letter, and that it just looks like a scan (some letters fused together, some ink noise etc.) And it is definitely vector (seen by zooming in the AZW reader). Last edited by loyola; 07-29-2012 at 07:57 AM. Reason: correction |
07-29-2012, 08:29 AM | #2 |
creator of calibre
Posts: 44,334
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The format is called topaz and calibre does not ahndle it at all. You have installed some third party dedrm plugins that convert the topaz to html for you.
|
Advert | |
|
07-29-2012, 11:39 PM | #3 | |||
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
Quote:
Quote:
Anyway you view it Topaz formatted books should be handled outside of calibre using Sigil and the original glyphs html to spell check and error check the conversion before adding the book to calibre. Last edited by DoctorOhh; 07-29-2012 at 11:42 PM. |
|||
08-04-2012, 11:19 AM | #4 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jul 2012
Device: Kindle
|
Thanks! I was not aware of the existence of this format. Converting to PDF via Calibre is a good solution (I want to avoid the annoying proprietary viewer).
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Creating Kindle Collections from Calibre's Data | mornington | Devices | 58 | 11-26-2013 08:00 PM |
lost data | milaklay | Apple Devices | 1 | 01-28-2011 02:53 PM |
when will calibre support vector graphics in pdf to epub conversion | smith9 | Calibre | 5 | 11-13-2010 05:03 AM |
lost the collection data on prs505 | smokey | Sony Reader | 3 | 02-06-2009 07:15 PM |