Decrypted Topaz Support - time to revisit?

ldolse · 06-08-2010, 09:40 PM

I recently purchased/downloaded several free books from Amazon and was dismayed to discover a large percentage of them were in Topaz format. Apparently the format is gaining popularity on Amazon, many are regular novels.

After checking out the regular sources on the DRM status I was delighted to discover that not only was the DRM cracked, but there are several python scripts to extract the metadata and convert the topaz formatting to SVG/HTML. It's not perfect yet, but it's pretty damned close.

I know decrypting topaz in Calibre is a non-starter, but the decryption script is separate from the scripts which convert topaz to html. Is it possible to make just the conversion scripts part of the official Calibre package? This way I can load decrypted topaz into Calibre and get the Metadata imported and convert to epub in a couple clicks.

DoctorOhh · 06-08-2010, 09:56 PM

I have no problem dragging the html in and converting it.

What I would like though is a way to Convert the SVG output of the scripts to ePub since the html output is usually an unproofed (error laden) OCR copy designed for text search of the original document

ldolse · 06-08-2010, 10:17 PM

I don't have any problems importing the html either from a technical perspective - it's more about convenience. Right now it requires running three separate scripts, creating temporary output directories, cleaning up the mess after you're done converting, Manually editing the metadata/cover in Calibre, etc.

That and once the conversion scripts are built into Calibre it's a simple matter of integrating a separate input plugin that's not supported by Kovid or Mobileread to handle the decryption so the whole process is drag and drop.

The SVG to epub is probably not a bad option (couldn't you just import all those svg xhmtl files into Sigil?), I didn't spend any time examining the SVG output directly - but that functionality doesn't really exist yet, whereas the other scripts are already done. I don't fully understand how the Topaz format works in terms of displaying the SVG vs. OCR. On the iPhone most things appear to be rendered as text, allowing selecting, font resize, etc - but other content is clearly an image which allows zooming and panning.

kovidgoyal · 06-08-2010, 10:25 PM

It's certainly doable, unfortunately, I have a lot of higher priority stuff I need to get through for the next little while.

This isn't very high priority for me, because I've never come across a topaz book you couldn't get in another format (I'm sure there are some), they just aren't in my reading list.

ldolse · 06-08-2010, 10:40 PM

No worries, if there are no objections then I'll just create an FR on your issue tracker then.

I wouldn't have considered it a big deal a week ago either, I'd never downloaded a Topaz book aside from a guidebook I checked out a long time ago. Then over the last week I downloaded 23 books (3 or 4 purchased, the rest some of the latest free offerings). 6 of the 23 are Topaz. That high a percentage is making me think there has been some shifting in the back end conversion systems Amazon is using/recommending with the publishers....

JMikeD · 06-08-2010, 10:55 PM

Quote:

Originally Posted by ldolse

That high a percentage is making me think there has been some shifting in the back end conversion systems Amazon is using/recommending with the publishers....

That may be too small sample to make that call. I know of several books that were originally in Topaz format that were then issued as a regular .azw file some months later. I emailed Amazon and they refunded the money for the Topaz books, and I then re-purchased them in standard format.

I always try to remember to get a sample before I purchase a book from amazon, that way I can keep from getting a Topaz one.

ldolse · 06-08-2010, 11:14 PM

I used to download samples and check as well, but based on the last year or so of purchases I'd seen a clear trend - textbook/guidebook-like is often Topaz, novels were always mobi. That general rule of thumb has held up for the last year or so, until this last batch of books, which is just the most recent sample set in a year of downloads, which does encompass ~70 books.

Anyway I'm less concerned about getting Topaz format as the quality of the html conversion from these scripts seems pretty decent at first glance. I'll know better once I actually read the converted content in full.

DoctorOhh · 06-08-2010, 11:47 PM

Quote:

Originally Posted by ldolse

Anyway I'm less concerned about getting Topaz format as the quality of the html conversion from these scripts seems pretty decent at first glance. I'll know better once I actually read the converted content in full.

The quality of the scripts is good. It pulls out the OCR'd text from the book which is pretty good considering it isn't proofed. When you read a Topaz book via any Kindle App it uses the SVG glyphs to present the body of the book and just uses the actual text on the side for searching the book.

You can read up on the creation of topaz from the person (screen-name: Fluffy) who created it in posts 800-812 and on his blog, archived here. Very interesting read.

ldolse · 06-09-2010, 01:36 AM

Hmm - I just went through some of the xhtml with the SVG data. For some of the content - title pages, contents, copyrights, etc - it would make a lot of sense to use this instead of the OCR. It looks like it might be as simple as dropping the xhtml files from the script into the appropriate points of the book using Sigil... Will give it a go.

There is some javascript stuff there for zooming in/out and changing pages - I'm not sure if this got added by the script or if the original content used it. Anyway I think the readers ignore javascript if I recall. The other speedbump is that a fair number of the xhtml files don't contain anything of value, at least in the one book I looked at.

It does seem like an ideal option in the long run would be to provide an option for two different types of epubs, one that bases the output on the OCR'd text, and another that bases it off the SVG output. Really curious if this sort of SVG content will wind up being fully compatible with the various epub renderers.

The info by the original developer was good - I'd already read his blog post, but I didn't see him participating in the other discussion before. I don't quite get some of the comments regarding dealing with layout, as these scripts do a great job of extracting images and putting them in the right places with html, and I also don't really understand how topaz is working from reflow perspective - reading it on an iphone or a kindle you wouldn't have any idea that the native format/view for this data is the original scanned page.

ldolse · 06-09-2010, 06:44 AM

This is all off-topic to the original point of the thread, but did a few tests with the svg xhtml's generated by the script. The Javascript and page changing svg objects need to all be removed, as that causes problems with Adobe DE and the reader. After that the inline css needs to be modified a bit along with changing the svg size from a fixed 6x9 inches to 100% width/height. After that it gets somewhat usable on the the PRS-505, but some of the more complicated content has some serious issues rendering.

I was using the title Infoquake from Amazon, there was a title page with the publisher's logo at the bottom of the page. This renders fine in Safari, Firefox, and Sigil, but Adobe DE somehow renders the logo at the top of the page instead of the bottom.

From a performance perspective there are also problems directly using the SVG files output by the script. Anything with a lot of elements - like a page of text - takes a long time to render on the reader. I tried this with the copyright page, as this was a bit of a disaster with the OCR converted version - rendering that probably took a good 30 seconds.

So it looks like the SVG version might be good for using as a reference when doing corrections on a regular computer, but I don't think the horsepower is there to use it on a device. That and you lose all the reflow/reformat features that exist in the Topaz format when it's converted to SVG.

DoctorOhh · 06-09-2010, 07:19 AM

Quote:

Originally Posted by ldolse

So it looks like the SVG version might be good for using as a reference when doing corrections on a regular computer, but I don't think the horsepower is there to use it on a device. That and you lose all the reflow/reformat features that exist in the Topaz format when it's converted to SVG.

That's what's I had guessed. But it is a great original source for correcting any OCR errors.

ldolse · 06-09-2010, 07:45 AM

Quote:

Originally Posted by dwanthny

That's what's I had guessed. But it is a great original source for correcting any OCR errors.

Absolutely agree - I don't think it would be much work for their current svg output script to output directly to epub, all that's really required is an OPF file listing out all the xhtml files that are already being created by the script (along with the couple other files to meet the epub spec) and zipping the whole package. On higher power devices like the ipad (which is also 'not' using ADE) this might actually be pretty good as is.

DoctorOhh · 06-09-2010, 08:01 AM

Quote:

Originally Posted by ldolse

On higher power devices like the ipad (which is also 'not' using ADE) this might actually be pretty good as is.

True, but the iPad already reads topaz files via Kindle4iPad app.

Starson17 · 06-10-2010, 03:01 PM

Quote:

Originally Posted by dwanthny

The quality of the scripts is good. It pulls out the OCR'd text from the book which is pretty good considering it isn't proofed. When you read a Topaz book via any Kindle App it uses the SVG glyphs to present the body of the book and just uses the actual text on the side for searching the book.

You can read up on the creation of topaz ... Very interesting read.

Dwanthny,

Are the "SVG glyphs" character glyphs, word glyphs or something else? I don't have a Kindle, but trying to read the links you posted, it appeared that Topaz formats were really scanned books, broken down into small images of words or characters to achieve reflow for the Kindle, with OCR text linked to the word/char images for searching. Does it look like that's what it's doing when you read such a book on the Kindle?

DoctorOhh · 06-10-2010, 06:53 PM

Quote:

Originally Posted by Starson17

Dwanthny,

Are the "SVG glyphs" character glyphs, word glyphs or something else? I don't have a Kindle, but trying to read the links you posted, it appeared that Topaz formats were really scanned books, broken down into small images of words or characters to achieve reflow for the Kindle, with OCR text linked to the word/char images for searching. Does it look like that's what it's doing when you read such a book on the Kindle?

I think you have the right idea. I think they are mostly character but I am unsure. I don't have a kindle, when I purchase from Amazon I use Kindle4PC and remove the drm. Higher up in this comment area the group discussed all aspects of topaz and glyphs as they tried to unravel the format.

06-08-2010, 09:40 PM	#1
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Decrypted Topaz Support - time to revisit? I recently purchased/downloaded several free books from Amazon and was dismayed to discover a large percentage of them were in Topaz format. Apparently the format is gaining popularity on Amazon, many are regular novels. After checking out the regular sources on the DRM status I was delighted to discover that not only was the DRM cracked, but there are several python scripts to extract the metadata and convert the topaz formatting to SVG/HTML. It's not perfect yet, but it's pretty damned close. I know decrypting topaz in Calibre is a non-starter, but the decryption script is separate from the scripts which convert topaz to html. Is it possible to make just the conversion scripts part of the official Calibre package? This way I can load decrypted topaz into Calibre and get the Metadata imported and convert to epub in a couple clicks. Last edited by ldolse; 06-08-2010 at 09:42 PM.

06-08-2010, 10:17 PM	#3
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I don't have any problems importing the html either from a technical perspective - it's more about convenience. Right now it requires running three separate scripts, creating temporary output directories, cleaning up the mess after you're done converting, Manually editing the metadata/cover in Calibre, etc. That and once the conversion scripts are built into Calibre it's a simple matter of integrating a separate input plugin that's not supported by Kovid or Mobileread to handle the decryption so the whole process is drag and drop. The SVG to epub is probably not a bad option (couldn't you just import all those svg xhmtl files into Sigil?), I didn't spend any time examining the SVG output directly - but that functionality doesn't really exist yet, whereas the other scripts are already done. I don't fully understand how the Topaz format works in terms of displaying the SVG vs. OCR. On the iPhone most things appear to be rendered as text, allowing selecting, font resize, etc - but other content is clearly an image which allows zooming and panning. Last edited by ldolse; 06-08-2010 at 10:23 PM.

06-09-2010, 01:36 AM	#9
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Hmm - I just went through some of the xhtml with the SVG data. For some of the content - title pages, contents, copyrights, etc - it would make a lot of sense to use this instead of the OCR. It looks like it might be as simple as dropping the xhtml files from the script into the appropriate points of the book using Sigil... Will give it a go. There is some javascript stuff there for zooming in/out and changing pages - I'm not sure if this got added by the script or if the original content used it. Anyway I think the readers ignore javascript if I recall. The other speedbump is that a fair number of the xhtml files don't contain anything of value, at least in the one book I looked at. It does seem like an ideal option in the long run would be to provide an option for two different types of epubs, one that bases the output on the OCR'd text, and another that bases it off the SVG output. Really curious if this sort of SVG content will wind up being fully compatible with the various epub renderers. The info by the original developer was good - I'd already read his blog post, but I didn't see him participating in the other discussion before. I don't quite get some of the comments regarding dealing with layout, as these scripts do a great job of extracting images and putting them in the right places with html, and I also don't really understand how topaz is working from reflow perspective - reading it on an iphone or a kindle you wouldn't have any idea that the native format/view for this data is the original scanned page. Last edited by ldolse; 06-09-2010 at 01:47 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Warning: resource oebps/font/Agaramond-regular.otf cannot be decrypted	prepress	Calibre	11	07-18-2011 12:55 AM
TOPAZ decrypted	mgmueller	News	15	06-24-2011 11:34 AM
My Run-In With Topaz	SpiderMatt	Amazon Kindle	50	03-13-2011 06:48 PM
Beautiful Topaz	Gideon	Amazon Kindle	21	06-10-2009 02:43 PM
Topaz looks horrible...	AnemicOak	Amazon Kindle	17	03-03-2009 10:18 PM

06-08-2010, 09:56 PM	#2
DoctorOhh US Navy, Retired Posts: 9,865 Karma: 13806776 Join Date: Feb 2009 Location: North Carolina Device: Icarus Illumina XL HD, Nexus 7	I have no problem dragging the html in and converting it. What I would like though is a way to Convert the SVG output of the scripts to ePub since the html output is usually an unproofed (error laden) OCR copy designed for text search of the original document

06-08-2010, 10:25 PM	#4
kovidgoyal creator of calibre Posts: 44,337 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It's certainly doable, unfortunately, I have a lot of higher priority stuff I need to get through for the next little while. This isn't very high priority for me, because I've never come across a topaz book you couldn't get in another format (I'm sure there are some), they just aren't in my reading list.

06-08-2010, 10:40 PM	#5
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	No worries, if there are no objections then I'll just create an FR on your issue tracker then. I wouldn't have considered it a big deal a week ago either, I'd never downloaded a Topaz book aside from a guidebook I checked out a long time ago. Then over the last week I downloaded 23 books (3 or 4 purchased, the rest some of the latest free offerings). 6 of the 23 are Topaz. That high a percentage is making me think there has been some shifting in the back end conversion systems Amazon is using/recommending with the publishers....

06-08-2010, 11:14 PM	#7
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I used to download samples and check as well, but based on the last year or so of purchases I'd seen a clear trend - textbook/guidebook-like is often Topaz, novels were always mobi. That general rule of thumb has held up for the last year or so, until this last batch of books, which is just the most recent sample set in a year of downloads, which does encompass ~70 books. Anyway I'm less concerned about getting Topaz format as the quality of the html conversion from these scripts seems pretty decent at first glance. I'll know better once I actually read the converted content in full.

06-09-2010, 06:44 AM	#10
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	This is all off-topic to the original point of the thread, but did a few tests with the svg xhtml's generated by the script. The Javascript and page changing svg objects need to all be removed, as that causes problems with Adobe DE and the reader. After that the inline css needs to be modified a bit along with changing the svg size from a fixed 6x9 inches to 100% width/height. After that it gets somewhat usable on the the PRS-505, but some of the more complicated content has some serious issues rendering. I was using the title Infoquake from Amazon, there was a title page with the publisher's logo at the bottom of the page. This renders fine in Safari, Firefox, and Sigil, but Adobe DE somehow renders the logo at the top of the page instead of the bottom. From a performance perspective there are also problems directly using the SVG files output by the script. Anything with a lot of elements - like a page of text - takes a long time to render on the reader. I tried this with the copyright page, as this was a bit of a disaster with the OCR converted version - rendering that probably took a good 30 seconds. So it looks like the SVG version might be good for using as a reference when doing corrections on a regular computer, but I don't think the horsepower is there to use it on a device. That and you lose all the reflow/reformat features that exist in the Topaz format when it's converted to SVG.

Advert

Advert