06-08-2010, 09:40 PM | #1 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Decrypted Topaz Support - time to revisit?
I recently purchased/downloaded several free books from Amazon and was dismayed to discover a large percentage of them were in Topaz format. Apparently the format is gaining popularity on Amazon, many are regular novels.
After checking out the regular sources on the DRM status I was delighted to discover that not only was the DRM cracked, but there are several python scripts to extract the metadata and convert the topaz formatting to SVG/HTML. It's not perfect yet, but it's pretty damned close. I know decrypting topaz in Calibre is a non-starter, but the decryption script is separate from the scripts which convert topaz to html. Is it possible to make just the conversion scripts part of the official Calibre package? This way I can load decrypted topaz into Calibre and get the Metadata imported and convert to epub in a couple clicks. Last edited by ldolse; 06-08-2010 at 09:42 PM. |
06-08-2010, 09:56 PM | #2 |
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
I have no problem dragging the html in and converting it.
What I would like though is a way to Convert the SVG output of the scripts to ePub since the html output is usually an unproofed (error laden) OCR copy designed for text search of the original document |
Advert | |
|
06-08-2010, 10:17 PM | #3 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I don't have any problems importing the html either from a technical perspective - it's more about convenience. Right now it requires running three separate scripts, creating temporary output directories, cleaning up the mess after you're done converting, Manually editing the metadata/cover in Calibre, etc.
That and once the conversion scripts are built into Calibre it's a simple matter of integrating a separate input plugin that's not supported by Kovid or Mobileread to handle the decryption so the whole process is drag and drop. The SVG to epub is probably not a bad option (couldn't you just import all those svg xhmtl files into Sigil?), I didn't spend any time examining the SVG output directly - but that functionality doesn't really exist yet, whereas the other scripts are already done. I don't fully understand how the Topaz format works in terms of displaying the SVG vs. OCR. On the iPhone most things appear to be rendered as text, allowing selecting, font resize, etc - but other content is clearly an image which allows zooming and panning. Last edited by ldolse; 06-08-2010 at 10:23 PM. |
06-08-2010, 10:25 PM | #4 |
creator of calibre
Posts: 44,337
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's certainly doable, unfortunately, I have a lot of higher priority stuff I need to get through for the next little while.
This isn't very high priority for me, because I've never come across a topaz book you couldn't get in another format (I'm sure there are some), they just aren't in my reading list. |
06-08-2010, 10:40 PM | #5 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
No worries, if there are no objections then I'll just create an FR on your issue tracker then.
I wouldn't have considered it a big deal a week ago either, I'd never downloaded a Topaz book aside from a guidebook I checked out a long time ago. Then over the last week I downloaded 23 books (3 or 4 purchased, the rest some of the latest free offerings). 6 of the 23 are Topaz. That high a percentage is making me think there has been some shifting in the back end conversion systems Amazon is using/recommending with the publishers.... |
Advert | |
|
06-08-2010, 10:55 PM | #6 | |
Evangelist
Posts: 473
Karma: 15000
Join Date: Jul 2008
Device: Various and sundry
|
Quote:
I always try to remember to get a sample before I purchase a book from amazon, that way I can keep from getting a Topaz one. |
|
06-08-2010, 11:14 PM | #7 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I used to download samples and check as well, but based on the last year or so of purchases I'd seen a clear trend - textbook/guidebook-like is often Topaz, novels were always mobi. That general rule of thumb has held up for the last year or so, until this last batch of books, which is just the most recent sample set in a year of downloads, which does encompass ~70 books.
Anyway I'm less concerned about getting Topaz format as the quality of the html conversion from these scripts seems pretty decent at first glance. I'll know better once I actually read the converted content in full. |
06-08-2010, 11:47 PM | #8 | |
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
You can read up on the creation of topaz from the person (screen-name: Fluffy) who created it in posts 800-812 and on his blog, archived here. Very interesting read. Last edited by DoctorOhh; 03-11-2014 at 03:59 AM. |
|
06-09-2010, 01:36 AM | #9 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Hmm - I just went through some of the xhtml with the SVG data. For some of the content - title pages, contents, copyrights, etc - it would make a lot of sense to use this instead of the OCR. It looks like it might be as simple as dropping the xhtml files from the script into the appropriate points of the book using Sigil... Will give it a go.
There is some javascript stuff there for zooming in/out and changing pages - I'm not sure if this got added by the script or if the original content used it. Anyway I think the readers ignore javascript if I recall. The other speedbump is that a fair number of the xhtml files don't contain anything of value, at least in the one book I looked at. It does seem like an ideal option in the long run would be to provide an option for two different types of epubs, one that bases the output on the OCR'd text, and another that bases it off the SVG output. Really curious if this sort of SVG content will wind up being fully compatible with the various epub renderers. The info by the original developer was good - I'd already read his blog post, but I didn't see him participating in the other discussion before. I don't quite get some of the comments regarding dealing with layout, as these scripts do a great job of extracting images and putting them in the right places with html, and I also don't really understand how topaz is working from reflow perspective - reading it on an iphone or a kindle you wouldn't have any idea that the native format/view for this data is the original scanned page. Last edited by ldolse; 06-09-2010 at 01:47 AM. |
06-09-2010, 06:44 AM | #10 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
This is all off-topic to the original point of the thread, but did a few tests with the svg xhtml's generated by the script. The Javascript and page changing svg objects need to all be removed, as that causes problems with Adobe DE and the reader. After that the inline css needs to be modified a bit along with changing the svg size from a fixed 6x9 inches to 100% width/height. After that it gets somewhat usable on the the PRS-505, but some of the more complicated content has some serious issues rendering.
I was using the title Infoquake from Amazon, there was a title page with the publisher's logo at the bottom of the page. This renders fine in Safari, Firefox, and Sigil, but Adobe DE somehow renders the logo at the top of the page instead of the bottom. From a performance perspective there are also problems directly using the SVG files output by the script. Anything with a lot of elements - like a page of text - takes a long time to render on the reader. I tried this with the copyright page, as this was a bit of a disaster with the OCR converted version - rendering that probably took a good 30 seconds. So it looks like the SVG version might be good for using as a reference when doing corrections on a regular computer, but I don't think the horsepower is there to use it on a device. That and you lose all the reflow/reformat features that exist in the Topaz format when it's converted to SVG. |
06-09-2010, 07:19 AM | #11 | |
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
|
|
06-09-2010, 07:45 AM | #12 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Absolutely agree - I don't think it would be much work for their current svg output script to output directly to epub, all that's really required is an OPF file listing out all the xhtml files that are already being created by the script (along with the couple other files to meet the epub spec) and zipping the whole package. On higher power devices like the ipad (which is also 'not' using ADE) this might actually be pretty good as is.
Last edited by ldolse; 06-09-2010 at 07:47 AM. |
06-09-2010, 08:01 AM | #13 |
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
|
06-10-2010, 03:01 PM | #14 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Are the "SVG glyphs" character glyphs, word glyphs or something else? I don't have a Kindle, but trying to read the links you posted, it appeared that Topaz formats were really scanned books, broken down into small images of words or characters to achieve reflow for the Kindle, with OCR text linked to the word/char images for searching. Does it look like that's what it's doing when you read such a book on the Kindle? |
|
06-10-2010, 06:53 PM | #15 | |
US Navy, Retired
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Warning: resource oebps/font/Agaramond-regular.otf cannot be decrypted | prepress | Calibre | 11 | 07-18-2011 12:55 AM |
TOPAZ decrypted | mgmueller | News | 15 | 06-24-2011 11:34 AM |
My Run-In With Topaz | SpiderMatt | Amazon Kindle | 50 | 03-13-2011 06:48 PM |
Beautiful Topaz | Gideon | Amazon Kindle | 21 | 06-10-2009 02:43 PM |
Topaz looks horrible... | AnemicOak | Amazon Kindle | 17 | 03-03-2009 10:18 PM |