|01-21-2011, 10:55 PM||#1|
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3 and Voyage
Topaz is an ebook format Amazon uses to quickly convert scanned paper books (e.g. from its look inside the book) to ebooks. It is in principle an interesting approach to ebooks, since it does reflow and can be searched even though it is very "close" to the paper original. However, in practice the quality of what you get on the screen is highly variable - sometimes beautiful typography but much more often unreadable (if you are at all sensitive to typeface blemishes).
So I have been avoiding Topaz ever since I got my Kindle 1. More recently I switched to non-Kindle devices and the weirdness of Topaz prevented DRM stripping and easy format conversion. Heroic efforts of reverse engineering made conversion to other formats possible over time, but I still considered them too "hands on" for routine use.
This is a major issue, because when there is a Topaz it is very often the only legal ebook version available.
I am delighted to report that, in my opinion, Topaz format shifting is now "good enough" so that you need not worry about whether a Kindle ebook is AZW or TPZ if you are buying it for the purpose of format shifting to an ePub-based reading device. As usual Apprentice Alf's Blog is the place to go to explore DRM issues. I use stand-alone DRM-stripping tools and the Calibre ebook-convert command line, but the 3rd party Calibre plugin for MOBI/AZW/TPZ will be the easiest approach for many. In fact, if the Topaz looks crappy on your Kindle device/app screen you may be better off having Calibre convert it to MOBI.
The only downside of the conversion is that it produces three ZIP files, and the 2nd and 3rd are very big. I only use the 1st one, which is based on the OCRed text. Conversion isn't perfect. Some bolding and italics are lost, chapter headings are often images (which prevents Calibre from detecting chapters for the TOC), and frontpiece material is usually images and can be poorly laid out.
I bought half a dozen Topaz ebooks by mistake over the years, and they have sat unread. I also had a long list of Topaz ebooks on my wishlist, on the off chance that they got replaced by a AZW (which does happen occasionally). Many of these are now sitting on my ebook reader as ePubs. Spot checks have not unearthed any problems, and the two I have read were almost perfect (fewer OCR errors than the typical AZW from paper).
Have my low expectations got the better of me? I don't miss bold and italics much, and I hated bad typefaces. I would still buy the average AZW or ePub over the TPZ any day, but when the Topaz is the only choice and a really want the ebook it is now a viable option.
|01-22-2011, 02:17 AM||#2|
Join Date: Mar 2010
Device: Kindle 2 International & Sony PRS-T1 & BlackBerry PlayBook
I've always considered Topaz to be a format with potential, despite its very definite flaws and instability when actually used.
The problem with it is that people don't use it to make the sort of ebooks that aren't possible with Mobi:
In fact, it would be nice if Topaz were a bit more like DjVu and people could opt to see the scanned book as-is, and also view a proper plain e-text redaction if they choose. Maybe even be able to toggle the display between the two.
In any case, I'd like it if Amazon were to one day release tools for ordinary people to make their own Topaz books, perhaps starting from a PDF or SVG document.
Because there's a lot that could be potentially done with the format, especially if Amazon remains adamant about not supporting ePub while also not improving Mobi; it's just a shame Topaz mostly gets used for books which don't need it and its more useful features do no one any good.
|01-22-2011, 05:48 PM||#3|
Join Date: Nov 2009
If you look in the _XML.zip you can actually see what the Topaz internal format looks like. The file format is xml based and even the glyphs themselves are described by xml files. As for reflowing, all but "fixed" regions can be reflowed. So save space, these xml files are encoded as binary data with offsets into a dictionary that is stored as the dict0000.dat file and stored as records in the Topaz file.
What I would like to do is combine the ocrText field info with the svg image of the page to create an image based pdf file with text searching capabilities. Unfortunately, I know of no opensource pdf creation software that will generate an image based pdf with ocrtext information embedded inside of it for searching.
If anyone knows of perl, python, or even C/C++ based code that can convert a sequence of svg images along with ocrText information into a searchable image based pdf, I would love to hear about it.
I do believe someone could take a regular ebook/xhtml file and use svg fonts to encode each letter (using a one to one mapping) and actually create an xml file describing each page of the book and an xml file for each glyph in the fonts and combine them to create a Topaz book of some sort.
It would take some code to do it but I do think it could be done. The nice thing is that the actual text of the xhtml file can be used in place of OCRText so there should be no errors.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|My Run-In With Topaz||SpiderMatt||Amazon Kindle||50||03-13-2011 06:48 PM|
|Can you tell if it's topaz before you buy it?||GA Russell||General Discussions||12||01-17-2011 11:13 AM|
|Beautiful Topaz||Gideon||Amazon Kindle||21||06-10-2009 02:43 PM|
|A Decent Topaz||Gideon||Amazon Kindle||4||04-21-2009 09:21 PM|
|Topaz looks horrible...||AnemicOak||Amazon Kindle||17||03-03-2009 10:18 PM|