Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 06-08-2010, 09:40 PM   #1
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Decrypted Topaz Support - time to revisit?

I recently purchased/downloaded several free books from Amazon and was dismayed to discover a large percentage of them were in Topaz format. Apparently the format is gaining popularity on Amazon, many are regular novels.

After checking out the regular sources on the DRM status I was delighted to discover that not only was the DRM cracked, but there are several python scripts to extract the metadata and convert the topaz formatting to SVG/HTML. It's not perfect yet, but it's pretty damned close.

I know decrypting topaz in Calibre is a non-starter, but the decryption script is separate from the scripts which convert topaz to html. Is it possible to make just the conversion scripts part of the official Calibre package? This way I can load decrypted topaz into Calibre and get the Metadata imported and convert to epub in a couple clicks.

Last edited by ldolse; 06-08-2010 at 09:42 PM.
ldolse is offline   Reply With Quote
Old 06-08-2010, 09:56 PM   #2
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,861
Karma: 12755553
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
I have no problem dragging the html in and converting it.

What I would like though is a way to Convert the SVG output of the scripts to ePub since the html output is usually an unproofed (error laden) OCR copy designed for text search of the original document
DoctorOhh is offline   Reply With Quote
 
Enthusiast
Old 06-08-2010, 10:17 PM   #3
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I don't have any problems importing the html either from a technical perspective - it's more about convenience. Right now it requires running three separate scripts, creating temporary output directories, cleaning up the mess after you're done converting, Manually editing the metadata/cover in Calibre, etc.

That and once the conversion scripts are built into Calibre it's a simple matter of integrating a separate input plugin that's not supported by Kovid or Mobileread to handle the decryption so the whole process is drag and drop.

The SVG to epub is probably not a bad option (couldn't you just import all those svg xhmtl files into Sigil?), I didn't spend any time examining the SVG output directly - but that functionality doesn't really exist yet, whereas the other scripts are already done. I don't fully understand how the Topaz format works in terms of displaying the SVG vs. OCR. On the iPhone most things appear to be rendered as text, allowing selecting, font resize, etc - but other content is clearly an image which allows zooming and panning.

Last edited by ldolse; 06-08-2010 at 10:23 PM.
ldolse is offline   Reply With Quote
Old 06-08-2010, 10:25 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,126
Karma: 5381911
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It's certainly doable, unfortunately, I have a lot of higher priority stuff I need to get through for the next little while.

This isn't very high priority for me, because I've never come across a topaz book you couldn't get in another format (I'm sure there are some), they just aren't in my reading list.
kovidgoyal is offline   Reply With Quote
Old 06-08-2010, 10:40 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
No worries, if there are no objections then I'll just create an FR on your issue tracker then.

I wouldn't have considered it a big deal a week ago either, I'd never downloaded a Topaz book aside from a guidebook I checked out a long time ago. Then over the last week I downloaded 23 books (3 or 4 purchased, the rest some of the latest free offerings). 6 of the 23 are Topaz. That high a percentage is making me think there has been some shifting in the back end conversion systems Amazon is using/recommending with the publishers....
ldolse is offline   Reply With Quote
Old 06-08-2010, 10:55 PM   #6
JMikeD
Evangelist
JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.
 
JMikeD's Avatar
 
Posts: 452
Karma: 15000
Join Date: Jul 2008
Device: Various and sundry
Quote:
Originally Posted by ldolse View Post
That high a percentage is making me think there has been some shifting in the back end conversion systems Amazon is using/recommending with the publishers....
That may be too small sample to make that call. I know of several books that were originally in Topaz format that were then issued as a regular .azw file some months later. I emailed Amazon and they refunded the money for the Topaz books, and I then re-purchased them in standard format.

I always try to remember to get a sample before I purchase a book from amazon, that way I can keep from getting a Topaz one.
JMikeD is offline   Reply With Quote
Old 06-08-2010, 11:14 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I used to download samples and check as well, but based on the last year or so of purchases I'd seen a clear trend - textbook/guidebook-like is often Topaz, novels were always mobi. That general rule of thumb has held up for the last year or so, until this last batch of books, which is just the most recent sample set in a year of downloads, which does encompass ~70 books.

Anyway I'm less concerned about getting Topaz format as the quality of the html conversion from these scripts seems pretty decent at first glance. I'll know better once I actually read the converted content in full.
ldolse is offline   Reply With Quote
Old 06-08-2010, 11:47 PM   #8
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,861
Karma: 12755553
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by ldolse View Post
Anyway I'm less concerned about getting Topaz format as the quality of the html conversion from these scripts seems pretty decent at first glance. I'll know better once I actually read the converted content in full.
The quality of the scripts is good. It pulls out the OCR'd text from the book which is pretty good considering it isn't proofed. When you read a Topaz book via any Kindle App it uses the SVG glyphs to present the body of the book and just uses the actual text on the side for searching the book.

You can read up on the creation of topaz from the person (screen-name: Fluffy) who created it in posts 800-812 and on his blog, archived here. Very interesting read.

Last edited by DoctorOhh; 03-11-2014 at 03:59 AM.
DoctorOhh is offline   Reply With Quote
Old 06-09-2010, 01:36 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Hmm - I just went through some of the xhtml with the SVG data. For some of the content - title pages, contents, copyrights, etc - it would make a lot of sense to use this instead of the OCR. It looks like it might be as simple as dropping the xhtml files from the script into the appropriate points of the book using Sigil... Will give it a go.

There is some javascript stuff there for zooming in/out and changing pages - I'm not sure if this got added by the script or if the original content used it. Anyway I think the readers ignore javascript if I recall. The other speedbump is that a fair number of the xhtml files don't contain anything of value, at least in the one book I looked at.

It does seem like an ideal option in the long run would be to provide an option for two different types of epubs, one that bases the output on the OCR'd text, and another that bases it off the SVG output. Really curious if this sort of SVG content will wind up being fully compatible with the various epub renderers.

The info by the original developer was good - I'd already read his blog post, but I didn't see him participating in the other discussion before. I don't quite get some of the comments regarding dealing with layout, as these scripts do a great job of extracting images and putting them in the right places with html, and I also don't really understand how topaz is working from reflow perspective - reading it on an iphone or a kindle you wouldn't have any idea that the native format/view for this data is the original scanned page.

Last edited by ldolse; 06-09-2010 at 01:47 AM.
ldolse is offline   Reply With Quote
Old 06-09-2010, 06:44 AM   #10
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
This is all off-topic to the original point of the thread, but did a few tests with the svg xhtml's generated by the script. The Javascript and page changing svg objects need to all be removed, as that causes problems with Adobe DE and the reader. After that the inline css needs to be modified a bit along with changing the svg size from a fixed 6x9 inches to 100% width/height. After that it gets somewhat usable on the the PRS-505, but some of the more complicated content has some serious issues rendering.

I was using the title Infoquake from Amazon, there was a title page with the publisher's logo at the bottom of the page. This renders fine in Safari, Firefox, and Sigil, but Adobe DE somehow renders the logo at the top of the page instead of the bottom.

From a performance perspective there are also problems directly using the SVG files output by the script. Anything with a lot of elements - like a page of text - takes a long time to render on the reader. I tried this with the copyright page, as this was a bit of a disaster with the OCR converted version - rendering that probably took a good 30 seconds.

So it looks like the SVG version might be good for using as a reference when doing corrections on a regular computer, but I don't think the horsepower is there to use it on a device. That and you lose all the reflow/reformat features that exist in the Topaz format when it's converted to SVG.
ldolse is offline   Reply With Quote
Old 06-09-2010, 07:19 AM   #11
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,861
Karma: 12755553
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by ldolse View Post
So it looks like the SVG version might be good for using as a reference when doing corrections on a regular computer, but I don't think the horsepower is there to use it on a device. That and you lose all the reflow/reformat features that exist in the Topaz format when it's converted to SVG.
That's what's I had guessed. But it is a great original source for correcting any OCR errors.
DoctorOhh is offline   Reply With Quote
Old 06-09-2010, 07:45 AM   #12
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by dwanthny View Post
That's what's I had guessed. But it is a great original source for correcting any OCR errors.
Absolutely agree - I don't think it would be much work for their current svg output script to output directly to epub, all that's really required is an OPF file listing out all the xhtml files that are already being created by the script (along with the couple other files to meet the epub spec) and zipping the whole package. On higher power devices like the ipad (which is also 'not' using ADE) this might actually be pretty good as is.

Last edited by ldolse; 06-09-2010 at 07:47 AM.
ldolse is offline   Reply With Quote
Old 06-09-2010, 08:01 AM   #13
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,861
Karma: 12755553
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by ldolse View Post
On higher power devices like the ipad (which is also 'not' using ADE) this might actually be pretty good as is.
True, but the iPad already reads topaz files via Kindle4iPad app.
DoctorOhh is offline   Reply With Quote
Old 06-10-2010, 03:01 PM   #14
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by dwanthny View Post
The quality of the scripts is good. It pulls out the OCR'd text from the book which is pretty good considering it isn't proofed. When you read a Topaz book via any Kindle App it uses the SVG glyphs to present the body of the book and just uses the actual text on the side for searching the book.

You can read up on the creation of topaz ... Very interesting read.
Dwanthny,

Are the "SVG glyphs" character glyphs, word glyphs or something else? I don't have a Kindle, but trying to read the links you posted, it appeared that Topaz formats were really scanned books, broken down into small images of words or characters to achieve reflow for the Kindle, with OCR text linked to the word/char images for searching. Does it look like that's what it's doing when you read such a book on the Kindle?
Starson17 is offline   Reply With Quote
Old 06-10-2010, 06:53 PM   #15
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,861
Karma: 12755553
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by Starson17 View Post
Dwanthny,

Are the "SVG glyphs" character glyphs, word glyphs or something else? I don't have a Kindle, but trying to read the links you posted, it appeared that Topaz formats were really scanned books, broken down into small images of words or characters to achieve reflow for the Kindle, with OCR text linked to the word/char images for searching. Does it look like that's what it's doing when you read such a book on the Kindle?
I think you have the right idea. I think they are mostly character but I am unsure. I don't have a kindle, when I purchase from Amazon I use Kindle4PC and remove the drm. Higher up in this comment area the group discussed all aspects of topaz and glyphs as they tried to unravel the format.
DoctorOhh is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Warning: resource oebps/font/Agaramond-regular.otf cannot be decrypted prepress Calibre 11 07-18-2011 12:55 AM
TOPAZ decrypted mgmueller News 15 06-24-2011 11:34 AM
My Run-In With Topaz SpiderMatt Amazon Kindle 50 03-13-2011 06:48 PM
Beautiful Topaz Gideon Amazon Kindle 21 06-10-2009 02:43 PM
Topaz looks horrible... AnemicOak Amazon Kindle 17 03-03-2009 10:18 PM


All times are GMT -4. The time now is 04:07 AM.


MobileRead.com is a privately owned, operated and funded community.