Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 07-29-2012, 07:38 AM   #1
loyola
Junior Member
loyola began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2012
Device: Kindle
Kindle AZW vector data lost upon adding to Calibre

I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.) These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher.

So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder. The HTML inside is already bad (before any explicit conversions even!) because the text in this html is based on the OCR layer of the AZW, not on the original vector graphics. (Typically such titles also have JPG/SVG/whatever graphics in various places, and as usual, Calibre imports them perfectly, but this is unrelated) The actual text of the book, in those places where the OCR was not perfect, is mangled and incorrect in the HTMLZ - sometimes severe misprints. Since mobi/epub/html do not support (as far as I know) this kind of format structure (i.e., ALL the text is vector graphics, with an invisible for-search-only OCR layer of text) I was wondering if it is possible to convert the book somehow so that this information (the correct text) is preserved. There is one, albeit dreadful, format which supports this weird way of saving ebooks...PDF. I've seen many scanned books that were saved as vector graphics in a PDF (instead of just leaving the scan as monochrome tifs and converting to DJVU) - I never understood why people do this, but apparently Amazon do too.

Calibre, as it stands, cannot properly handle all those AZW titles that are saved in this weird way (typically these are scientific books, because the OCR cannot handle the special symbols, so they save all the text as vector so that nothing is lost in the final product - only the search will not be perfect).

[edit:] I should have added that in such books it is not the case that each page is one big vector image. Rather, each character is saved as a vector image in itself and the AZW format coordinates these images into strings of words, spaces, etc (but not in the way that DJVU saves pagewise coordinates for each character. In these weird AZW vector-text books one can, for example, change the width of the rows by resizing the reader window and everything shuffles around just like "good" regular font-based-text ebooks. One can see this by observing that there are several different copies of each letter, and that it just looks like a scan (some letters fused together, some ink noise etc.) And it is definitely vector (seen by zooming in the AZW reader).

Last edited by loyola; 07-29-2012 at 07:57 AM. Reason: correction
loyola is offline   Reply With Quote
Old 07-29-2012, 08:29 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,685
Karma: 4998489
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The format is called topaz and calibre does not ahndle it at all. You have installed some third party dedrm plugins that convert the topaz to html for you.
kovidgoyal is online now   Reply With Quote
Old 07-29-2012, 11:39 PM   #3
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,801
Karma: 12534285
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by loyola View Post
I haven't seen this issue treated anywhere. Some Amazon kindle titles have all their text saved as vector graphics (in the bad sense - they scanned the book and all the text is vector - not text with certain fonts, but actual vector images, in other words different instances of the letter "e" can look different.)
The format is referred to as Topaz and these books view perfectly in Kindle readers and Kindle apps.

Quote:
Originally Posted by loyola View Post
These books also contain a "text layer" which was added by OCR. The OCR is not perfect of course, it is just for searching purposes. This is Amazon being lazy, and not obtaining the original digital data from the publisher.
This is not Amazon being lazy. For the most part these books do not have digital data. Amazon is contracted to scan the books and turn them into digital. Amazon does a great job of replicating the book in a form the Kindle can handle. You can read up on the creation of topaz from the person (screen-name: Fluffy) who created it in posts 800-812 and on his blog here. Very interesting read.

Quote:
Originally Posted by loyola View Post
So here's the problem: when you add such books to Calibre, it creates an HTMLZ copy of the book (as usual) in the Calibre collection folder.
As pointed out that is the result of a third party plugin. The same set of tools from Apprentice Alf's site used apart from calibre should also create a perfect html version of the book using the svg glyphs that can be viewed in Firefox. You can use Sigil (to view the OCR version) and you can use the svg html version in Firefox to do a A > B compare with the OCR version to correct any formatting errors.

Anyway you view it Topaz formatted books should be handled outside of calibre using Sigil and the original glyphs html to spell check and error check the conversion before adding the book to calibre.

Last edited by DoctorOhh; 07-29-2012 at 11:42 PM.
DoctorOhh is offline   Reply With Quote
Old 08-04-2012, 11:19 AM   #4
loyola
Junior Member
loyola began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2012
Device: Kindle
Thanks! I was not aware of the existence of this format. Converting to PDF via Calibre is a good solution (I want to avoid the annoying proprietary viewer).
loyola is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Creating Kindle Collections from Calibre's Data mornington Devices 58 11-26-2013 08:00 PM
lost data milaklay Apple Devices 1 01-28-2011 02:53 PM
when will calibre support vector graphics in pdf to epub conversion smith9 Calibre 5 11-13-2010 05:03 AM
lost the collection data on prs505 smokey Sony Reader 3 02-06-2009 07:15 PM


All times are GMT -4. The time now is 04:23 PM.


MobileRead.com is a privately owned, operated and funded community.