MobileRead Forums - View Single Post - Guide for converting Kindle Topaz (xhtml with svg) to PDF

Fschumaur · 09-27-2011, 08:31 PM

So, recently I was convert a book I purchased for my kindle into pdf so I could take it on my computer and read it w/o the Kindle 4 PC and I used an automagic tool to strip DRM and convert it to htmlz, but the OCR looked like crap, even though the file looked alright on my Kindle.

Fortunately, the unDRM tool had an output that was many xhtml pages (one for each page of the book) and the xhtml had svgs (Scalable Vector Graphics) to display the letters and images. I could view the book just fine using Firefox, but that was clumsy and required me to port the 300+ files around. Trying to use Calibre failed miserably (I got an error a quarter the way through, but even then the book was 180MB in size).

So, I messed around with many different tools and finally arrived at a solution. It's not the most elegant solution in the world, but it is (roughly) cross platform.

Note: This guide assumes that you have already (and ethically) stripped the DRM from the topaz book in question. This thread is not meant to discuss said stripping, there are other forums for that.

Say it with me now I will NOT strip DRM from a book I do not own, nor will I "liberate" books for the intent of piracy. Got it? Good.

Requirements and things you need:
-The book. My tool outputs the following message and leaves three formats.

Code:

Book Successfully generated
Creating NoDRM HTMLZ Archive
Creating SVG HTMLZ Archive
Creating XML ZIP Archive

We need the SVG (HTMLZ) archive. Extract it. It's just a renamed zip file.
- An operating system that is either Windows XP, Ubuntu 10.4, or MacOS 10.6 or older. One of the programs (Prince) does not like Windows 7, so if that's all you got, pray it works in compatibility mode or use a VM. Mac people, I have not tested your system so, cross your fingers.
-Notepad++ (Open Source) Download Here
-Prince (Free for non commercial use)This is the most finicky program of the bunch
-PDF Split and Merge (Open Source) Download Here
-Jave JRE An unattended Ninite Installer/Updater
-Briss (Open Source, requires JRE) Project Home Page

Step 0:Install all above programs and verify that they run.

Step 1: Feel free to open the index_svg.xhtml page that is in the root of the unzipped book in Firefox or Safari. The book should look like it does on your ereader. Verify that in the same directory as index_svg.xhtml, you see at least two folders svg and img. You may see more.

Step 2: Looking at your book in firefox, you may notice some artifacts that are not your book, mainly the go forward and go back buttons and a zoomin zoomout sort of thing. Let's get rid of those using notepad++ (inspired by this site).
Open up page0000, page0001 and whatever your last page is in notepad++. Hopefully, your page0000 and page0001 are not very busy (maybe a title page or something) because that makes identifying the artifacts easier. In between your documents' "Body tags" (i.e. <body>[lots and lots of stuff</body>) you should see at least 4 "a" tags, one that says

Code:

<a href="javascript:ppage();"><[more stuff]</a>

Two that say

Code:

<a href="javascript:npage();"><[more stuff]</a>

and one that says

Code:

<div><a href="javascript:zoomin();">zoom in</a> - <a href="javascript:zoomout();">zoom out</a></div>

For me the zoomin, zoomout one was very near the bottom of the text.
We want to remove all of the ppage() scripts and all of the zoom scripts, so let's do that now. You may notice that the page0000 has a different ppage() than page0001, page0002 and all the rest of the pages and that's because if you are at the beginning of the book, you can't go any more backward. Remove the entire ppage() script from page0000 manually. Check in Firefox that you didn't delete something important and then copy the ppage() script from page0001.

Go to Search>Find in Files and paste the script into find, blank out replace, and choose the directory of all the individual pages. You can optionally put in a filter of *.xhtml Open up some of the pages in firefox and you should notice that the arrow to go backwards is gone. Do the same thing for the zoomin and zoom out. You may want to change the background color to white so you can do a replace

Code:

<body onLoad="setsize();" style="background-color:#777;

with

Code:

<body onLoad="setsize();" style="background-color:#FFF;

Since there are two npage() scripts (one from clicking the forward arrow and one from clicking the page), we need to remove the proper one. It should be the same in page0000 as it is in page0001 as it is in page0024 ... Recall that the last page has its own unique one as well. For referance, mine looks like

Code:

<a href="javascript:npage();"><svg id="nextsvg" viewBox="0 0 100 300" xmlns="http://www.w3.org/2000/svg" version="1.1" style="background-color:#777"><polygon points="5,5,5,295,95,150" fill="#AAAAAA" /></svg></a>

the id="nextsvg" is a dead give away that this generates the next page and isn't this page's main content.
If you choose the wrong one to remove, your page will not be displayed at all!

Check in Firefox that your book looks like you want it to.

Step 3: After that doozy of a step 2, it gets easier from here. Open up Prince and add all of your pages to the queue and convert them. Check that the output rendered correctly (Don't worry about the margins if you have them, we'll banish those later). This may take a little bit. You should now have n separate pdfs, where n is the number of pages in your book.

Step 4: Let's merge all the single page pdfs into one big one! Open up PDFSAM and select the merge/extract option and load all of the single pdfs into it, type in a file name and click Run. This shouldn't take too long.

Step 5: Adjust the Margins. Prince tends to print your xhtml into a letter sized 8.5 x 11 piece of "paper" and that tends to leave us with big margins. Open up Briss and load your single pdf from step 4 into it. You will see all of your even pages and your odd pages overlaid on top of each other and you can adjust the cropping margins to whatever you want by dragging the upper left and lower right hand corners of the shaded "1" area. The preview function here is your friend. Tell it to crop and you are done!

Other things to do:
Change the PDF metadata with becypdfmetaedit.
Take your output from step 2 and load it into sigil and make an epub! Note: this does not translate well into other formats, but it appears that epubs don't mind svgs.

09-27-2011, 08:31 PM	#1
Fschumaur Junior Member Posts: 5 Karma: 10 Join Date: Sep 2011 Device: Kindle 3G	Guide for converting Kindle Topaz (xhtml with svg) to PDF So, recently I was convert a book I purchased for my kindle into pdf so I could take it on my computer and read it w/o the Kindle 4 PC and I used an automagic tool to strip DRM and convert it to htmlz, but the OCR looked like crap, even though the file looked alright on my Kindle. Fortunately, the unDRM tool had an output that was many xhtml pages (one for each page of the book) and the xhtml had svgs (Scalable Vector Graphics) to display the letters and images. I could view the book just fine using Firefox, but that was clumsy and required me to port the 300+ files around. Trying to use Calibre failed miserably (I got an error a quarter the way through, but even then the book was 180MB in size). So, I messed around with many different tools and finally arrived at a solution. It's not the most elegant solution in the world, but it is (roughly) cross platform. Note: This guide assumes that you have already (and ethically) stripped the DRM from the topaz book in question. This thread is not meant to discuss said stripping, there are other forums for that. Say it with me now I will NOT strip DRM from a book I do not own, nor will I "liberate" books for the intent of piracy. Got it? Good. Requirements and things you need: -The book. My tool outputs the following message and leaves three formats. Code: Book Successfully generated Creating NoDRM HTMLZ Archive Creating SVG HTMLZ Archive Creating XML ZIP Archive We need the SVG (HTMLZ) archive. Extract it. It's just a renamed zip file. - An operating system that is either Windows XP, Ubuntu 10.4, or MacOS 10.6 or older. One of the programs (Prince) does not like Windows 7, so if that's all you got, pray it works in compatibility mode or use a VM. Mac people, I have not tested your system so, cross your fingers. -Notepad++ (Open Source) Download Here -Prince (Free for non commercial use)This is the most finicky program of the bunch -PDF Split and Merge (Open Source) Download Here -Jave JRE An unattended Ninite Installer/Updater -Briss (Open Source, requires JRE) Project Home Page Step 0:Install all above programs and verify that they run. Step 1: Feel free to open the index_svg.xhtml page that is in the root of the unzipped book in Firefox or Safari. The book should look like it does on your ereader. Verify that in the same directory as index_svg.xhtml, you see at least two folders svg and img. You may see more. Step 2: Looking at your book in firefox, you may notice some artifacts that are not your book, mainly the go forward and go back buttons and a zoomin zoomout sort of thing. Let's get rid of those using notepad++ (inspired by this site). Open up page0000, page0001 and whatever your last page is in notepad++. Hopefully, your page0000 and page0001 are not very busy (maybe a title page or something) because that makes identifying the artifacts easier. In between your documents' "Body tags" (i.e. <body>[lots and lots of stuff</body>) you should see at least 4 "a" tags, one that says Code: <a href="javascript:ppage();"><[more stuff]</a> Two that say Code: <a href="javascript:npage();"><[more stuff]</a> and one that says Code: <div><a href="javascript:zoomin();">zoom in</a> - <a href="javascript:zoomout();">zoom out</a></div> For me the zoomin, zoomout one was very near the bottom of the text. We want to remove all of the ppage() scripts and all of the zoom scripts, so let's do that now. You may notice that the page0000 has a different ppage() than page0001, page0002 and all the rest of the pages and that's because if you are at the beginning of the book, you can't go any more backward. Remove the entire ppage() script from page0000 manually. Check in Firefox that you didn't delete something important and then copy the ppage() script from page0001. Go to Search>Find in Files and paste the script into find, blank out replace, and choose the directory of all the individual pages. You can optionally put in a filter of .xhtml Open up some of the pages in firefox and you should notice that the arrow to go backwards is gone. Do the same thing for the zoomin and zoom out. You may want to change the background color to white so you can do a replace Code: <body onLoad="setsize();" style="background-color:#777; with Code: <body onLoad="setsize();" style="background-color:#FFF; Since there are two npage() scripts (one from clicking the forward arrow and one from clicking the page), we need to remove the proper one. It should be the same in page0000 as it is in page0001 as it is in page0024 ... Recall that the last page has its own unique one as well. For referance, mine looks like Code: <a href="javascript:npage();"><svg id="nextsvg" viewBox="0 0 100 300" xmlns="http://www.w3.org/2000/svg" version="1.1" style="background-color:#777"><polygon points="5,5,5,295,95,150" fill="#AAAAAA" /></svg></a> the id="nextsvg" is a dead give away that this generates the next page and isn't this page's main content. If you choose the wrong one to remove, your page will not be displayed at all!* Check in Firefox that your book looks like you want it to. Step 3: After that doozy of a step 2, it gets easier from here. Open up Prince and add all of your pages to the queue and convert them. Check that the output rendered correctly (Don't worry about the margins if you have them, we'll banish those later). This may take a little bit. You should now have n separate pdfs, where n is the number of pages in your book. Step 4: Let's merge all the single page pdfs into one big one! Open up PDFSAM and select the merge/extract option and load all of the single pdfs into it, type in a file name and click Run. This shouldn't take too long. Step 5: Adjust the Margins. Prince tends to print your xhtml into a letter sized 8.5 x 11 piece of "paper" and that tends to leave us with big margins. Open up Briss and load your single pdf from step 4 into it. You will see all of your even pages and your odd pages overlaid on top of each other and you can adjust the cropping margins to whatever you want by dragging the upper left and lower right hand corners of the shaded "1" area. The preview function here is your friend. Tell it to crop and you are done! Other things to do: Change the PDF metadata with becypdfmetaedit. Take your output from step 2 and load it into sigil and make an epub! Note: this does not translate well into other formats, but it appears that epubs don't mind svgs.