How can I convert topaz ebook from multiple xhtml's (SVG) to single pdf?

rglk · 03-20-2011, 09:56 AM

I purchased a Kindle ebook that turned out to be topaz formatted. I don't like reading Kindle ebooks in Kindle for PC (running in Wine in Linux), so I passed this book through the KindleBooks.pyw program from DRM_tools_v3.7 and then used Calibre to convert the .zip file containing a single html file plus .css, .opf and images into a Mobi ebook. However, in the process some formatting is lost and the text is corrupted through Amazon OCR errors.

KindleBooks.pyw also produced a ...SVG.zip file that contains an "img" folder with many .jpg and .svg images, a folder "svg", and a file index_svg.xhtml. The "svg" folder holds all of the book's pages as individual xhtml files that I can inspect with my webbrowser (and navigate through with javascript); they represent the original scanned images of all the book's pages.

I would like now to convert and assemble all these individual pages to a single pdf file that I can read with Adobe Reader, e.g. as "single page continuous". How can I do that?

Many thanks,

Rob

rglk · 03-25-2011, 07:32 AM

I haven't found a satisfactory solution to this problem. I'd also posted my query to Apprentice Alf's blog, and some_updates responded as follows:

Quote:

You can do it via Calibe by importing the index_svg.xhtml and Calibre is smart enough to grab the svg images from the xhtml and you can convert them to a pdf (image only). Alternatively, you can use software like inkscape to automate the process of converting each svg page image into a cropped png file (cropped to remove the added navigational trianges and zoom info) and them compile them to a nicer pdf file. Inkscape takes command line options that can be used to automate the conversion and cropping process. You will lose all links and table of contents info since the pdf will simply be a set of images.

It might be easier to spellcheck and fix the original html version since it will have proper toc and links. A better solution would be to combine both into a dual layer text and image pdf to retain the benefits of both formats but there is no free software that does that.

To which I replied:

Thanks, some_update, for your good suggestions.

1. I was able to import the SVG data by adding index_svg.xhtml to Calibre and then converting the resulting zip to pdf. After 40 min of grinding away, Calibre produced a 210 MB single pdf of the 300 page book that did contain the original scanned images of all the pages (before OCR) but also the javascript navigation triangles and zoom buttons plus a third of a blank page inserted after every book page. That’s not really what I wanted.

2. The …SVG.zip output from KindleBooks.pyw (in the SVG folder) contained xhtml images of all book pages, not svg images, and Inkscape couldn’t handle these. To crop these images and remove the javascript code, white space, etc., I would have had to edit every xhtml page file with an html editor. I played around with this a bit in Mozilla Seamonkey Composer but then gave up, just couldn’t handle it.

3. Spellchecking and fixing the html file produced by Amazon through OCR also wasn’t feasible, as the text contains numerous Sanskrit and Tibetan terms (transliterated into Roman script) many of which had been corrupted by the OCR process and would have to be fixed by hand.

So thanks again for your help but I haven’t found a satisfactory solution to this problem. I’ll be very leery to purchase another Kindle book that’s Topaz DRM’ed if that restricts me to reading it only in Kindle apps such as Kindle for PC. But then, how does one know beforehand whether a given Kindle book is Topaz-encrypted?

ATDrake · 03-25-2011, 01:38 PM

Quote:

Originally Posted by rglk

But then, how does one know beforehand whether a given Kindle book is Topaz-encrypted?

Look at the Product Details section of the Amazon description page. If it has only a Print Length and no accompanying File Size, then it's a Topaz book. If it has also/only a File Size, then it's Mobi.

Also, for step 2), it sounds like the cruft you have to strip out is auto-generated and probably fairly uniform, so if you have any scripting skills, you might be able to whip up an auto-converter to make your life easier.

Hope this helps, and welcome to MobileRead!

knever · 11-28-2011, 04:33 PM

I know this is an old thread, but something relevant that seems to work so I can use puchased Kindle content on either my Kindle or my eReader is:

1) following instructions in Apprentice Alf's tools_v4.8.zip/ DeDRM_for_Mac_and_Win / WinApp_2.8 / ReadMe_DeDRM_WinApp.txt, using:
a) ActiveState ActivePython-2.7.2.5-win32-x86 . Community Edition
b) pycrypto-2.3.win32-py2.7
c) DeDRM_WinApp_2.8 out of Apprentice Alf's tools_v4.8
d) Calibre 0.8.28

2) following instructions and specifying the encrypted Topaz file X.azw in DeDRM produced 3 outputs:
a) X_SVG (zipped folder)
b) X_XML (zipped folder)
c) X_nodrm.htmlz (unzipped file)

3) As reported in this thread, progress with the SVG and XML folders was tedious, but reimporting the nodrm-htmlz file into Calibre allows easy exporting (eg: a PDF file with just adequate layout or an ePub file with reasonable layout and all figures and the search functionality intact).

K.

03-20-2011, 09:56 AM	#1
rglk Member Posts: 12 Karma: 10 Join Date: Mar 2011 Device: PC (Linux)	How can I convert topaz ebook from multiple xhtml's (SVG) to single pdf? I purchased a Kindle ebook that turned out to be topaz formatted. I don't like reading Kindle ebooks in Kindle for PC (running in Wine in Linux), so I passed this book through the KindleBooks.pyw program from DRM_tools_v3.7 and then used Calibre to convert the .zip file containing a single html file plus .css, .opf and images into a Mobi ebook. However, in the process some formatting is lost and the text is corrupted through Amazon OCR errors. KindleBooks.pyw also produced a ...SVG.zip file that contains an "img" folder with many .jpg and .svg images, a folder "svg", and a file index_svg.xhtml. The "svg" folder holds all of the book's pages as individual xhtml files that I can inspect with my webbrowser (and navigate through with javascript); they represent the original scanned images of all the book's pages. I would like now to convert and assemble all these individual pages to a single pdf file that I can read with Adobe Reader, e.g. as "single page continuous". How can I do that? Many thanks, Rob

11-28-2011, 04:33 PM	#4
knever Junior Member Posts: 1 Karma: 10 Join Date: Nov 2011 Location: London, UK Device: eReader	This seems to work... I know this is an old thread, but something relevant that seems to work so I can use puchased Kindle content on either my Kindle or my eReader is: 1) following instructions in Apprentice Alf's tools_v4.8.zip/ DeDRM_for_Mac_and_Win / WinApp_2.8 / ReadMe_DeDRM_WinApp.txt, using: a) ActiveState ActivePython-2.7.2.5-win32-x86 . Community Edition b) pycrypto-2.3.win32-py2.7 c) DeDRM_WinApp_2.8 out of Apprentice Alf's tools_v4.8 d) Calibre 0.8.28 2) following instructions and specifying the encrypted Topaz file X.azw in DeDRM produced 3 outputs: a) X_SVG (zipped folder) b) X_XML (zipped folder) c) X_nodrm.htmlz (unzipped file) 3) As reported in this thread, progress with the SVG and XML folders was tedious, but reimporting the nodrm-htmlz file into Calibre allows easy exporting (eg: a PDF file with just adequate layout or an ePub file with reasonable layout and all figures and the search functionality intact). K. Last edited by knever; 11-28-2011 at 04:35 PM. Reason: correction

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
multiple xhtml's to pdf	monkeyman224	Amazon Kindle	3	10-16-2010 02:39 AM
Converting multiple HTML files into a single hyperlinked PDF?	Jürgen Hubert	Reading and Management	6	01-11-2010 07:44 AM
How do you handle multiple stories in a single book?	Sabardeyn	Calibre	1	06-24-2009 02:42 PM
Convert multiple images(comics) to PDF - MAC	stustaff	Sony Reader	2	11-28-2007 10:31 AM
Convert offline websites into a single pdf?	magogo	Sony Reader	7	05-12-2007 12:05 PM

Advert