View Full Version : Topaz Reconsidered

01-21-2011, 10:55 PM
Topaz is an ebook format Amazon uses to quickly convert scanned paper books (e.g. from its look inside the book) to ebooks. It is in principle an interesting approach to ebooks, since it does reflow and can be searched even though it is very "close" to the paper original. However, in practice the quality of what you get on the screen is highly variable - sometimes beautiful typography but much more often unreadable (if you are at all sensitive to typeface blemishes).

So I have been avoiding Topaz ever since I got my Kindle 1. More recently I switched to non-Kindle devices and the weirdness of Topaz prevented DRM stripping and easy format conversion. Heroic efforts of reverse engineering made conversion to other formats possible over time, but I still considered them too "hands on" for routine use.

This is a major issue, because when there is a Topaz it is very often the only legal ebook version available.

I am delighted to report that, in my opinion, Topaz format shifting is now "good enough" so that you need not worry about whether a Kindle ebook is AZW or TPZ if you are buying it for the purpose of format shifting to an ePub-based reading device. As usual Apprentice Alf's Blog is the place to go to explore DRM issues. I use stand-alone DRM-stripping tools and the Calibre ebook-convert command line, but the 3rd party Calibre plugin for MOBI/AZW/TPZ will be the easiest approach for many. In fact, if the Topaz looks crappy on your Kindle device/app screen you may be better off having Calibre convert it to MOBI.

The only downside of the conversion is that it produces three ZIP files, and the 2nd and 3rd are very big. I only use the 1st one, which is based on the OCRed text. Conversion isn't perfect. Some bolding and italics are lost, chapter headings are often images (which prevents Calibre from detecting chapters for the TOC), and frontpiece material is usually images and can be poorly laid out.

I bought half a dozen Topaz ebooks by mistake over the years, and they have sat unread. I also had a long list of Topaz ebooks on my wishlist, on the off chance that they got replaced by a AZW (which does happen occasionally). Many of these are now sitting on my ebook reader as ePubs. Spot checks have not unearthed any problems, and the two I have read were almost perfect (fewer OCR errors than the typical AZW from paper).

Have my low expectations got the better of me? I don't miss bold and italics much, and I hated bad typefaces. I would still buy the average AZW or ePub over the TPZ any day, but when the Topaz is the only choice and a really want the ebook it is now a viable option.

01-22-2011, 02:17 AM
I've always considered Topaz to be a format with potential, despite its very definite flaws and instability when actually used.

The problem with it is that people don't use it to make the sort of ebooks that aren't possible with Mobi:

stuff with carefully arranged images amongst the text like certain illustrated editions,
anything requiring fancy typographical layout like some forms of poetry,
foreign language and invented scripts not covered by the current set of fonts which would look very bad rendered in tiny little inline gifs,
books with complex data tables in them that turn into squinty unreadability under Mobi's max 128kb image rule (would need a zoom and pan function, though)

Also, the format is kind of lacking in standard e-reading features (I've never seen a Topaz book with flickable chapter markers and many seem to be missing Tables of Contents), and would also be vastly improved if the makers could only link a proper corresponding text version to the book instead of the auto-OCR.

In fact, it would be nice if Topaz were a bit more like DjVu and people could opt to see the scanned book as-is, and also view a proper plain e-text redaction if they choose. Maybe even be able to toggle the display between the two.

In any case, I'd like it if Amazon were to one day release tools for ordinary people to make their own Topaz books, perhaps starting from a PDF or SVG document.

Because there's a lot that could be potentially done with the format, especially if Amazon remains adamant about not supporting ePub while also not improving Mobi; it's just a shame Topaz mostly gets used for books which don't need it and its more useful features do no one any good.

01-22-2011, 05:48 PM

If you look in the you can actually see what the Topaz internal format looks like. The file format is xml based and even the glyphs themselves are described by xml files. As for reflowing, all but "fixed" regions can be reflowed. So save space, these xml files are encoded as binary data with offsets into a dictionary that is stored as the dict0000.dat file and stored as records in the Topaz file.

What I would like to do is combine the ocrText field info with the svg image of the page to create an image based pdf file with text searching capabilities. Unfortunately, I know of no opensource pdf creation software that will generate an image based pdf with ocrtext information embedded inside of it for searching.

If anyone knows of perl, python, or even C/C++ based code that can convert a sequence of svg images along with ocrText information into a searchable image based pdf, I would love to hear about it.

I do believe someone could take a regular ebook/xhtml file and use svg fonts to encode each letter (using a one to one mapping) and actually create an xml file describing each page of the book and an xml file for each glyph in the fonts and combine them to create a Topaz book of some sort.

It would take some code to do it but I do think it could be done. The nice thing is that the actual text of the xhtml file can be used in place of OCRText so there should be no errors.