View Single Post
Old 03-25-2011, 07:32 AM   #2
rglk began at the beginning.
Posts: 12
Karma: 10
Join Date: Mar 2011
Device: PC (Linux)
I haven't found a satisfactory solution to this problem. I'd also posted my query to Apprentice Alf's blog, and some_updates responded as follows:

You can do it via Calibe by importing the index_svg.xhtml and Calibre is smart enough to grab the svg images from the xhtml and you can convert them to a pdf (image only). Alternatively, you can use software like inkscape to automate the process of converting each svg page image into a cropped png file (cropped to remove the added navigational trianges and zoom info) and them compile them to a nicer pdf file. Inkscape takes command line options that can be used to automate the conversion and cropping process. You will lose all links and table of contents info since the pdf will simply be a set of images.

It might be easier to spellcheck and fix the original html version since it will have proper toc and links. A better solution would be to combine both into a dual layer text and image pdf to retain the benefits of both formats but there is no free software that does that.
To which I replied:

Thanks, some_update, for your good suggestions.

1. I was able to import the SVG data by adding index_svg.xhtml to Calibre and then converting the resulting zip to pdf. After 40 min of grinding away, Calibre produced a 210 MB single pdf of the 300 page book that did contain the original scanned images of all the pages (before OCR) but also the javascript navigation triangles and zoom buttons plus a third of a blank page inserted after every book page. That’s not really what I wanted.

2. The … output from KindleBooks.pyw (in the SVG folder) contained xhtml images of all book pages, not svg images, and Inkscape couldn’t handle these. To crop these images and remove the javascript code, white space, etc., I would have had to edit every xhtml page file with an html editor. I played around with this a bit in Mozilla Seamonkey Composer but then gave up, just couldn’t handle it.

3. Spellchecking and fixing the html file produced by Amazon through OCR also wasn’t feasible, as the text contains numerous Sanskrit and Tibetan terms (transliterated into Roman script) many of which had been corrupted by the OCR process and would have to be fixed by hand.

So thanks again for your help but I haven’t found a satisfactory solution to this problem. I’ll be very leery to purchase another Kindle book that’s Topaz DRM’ed if that restricts me to reading it only in Kindle apps such as Kindle for PC. But then, how does one know beforehand whether a given Kindle book is Topaz-encrypted?

Last edited by rglk; 03-25-2011 at 07:37 AM.
rglk is offline   Reply With Quote