Guide for converting Kindle Topaz (xhtml with svg) to PDF

Fschumaur · 09-27-2011, 07:31 PM

So, recently I was convert a book I purchased for my kindle into pdf so I could take it on my computer and read it w/o the Kindle 4 PC and I used an automagic tool to strip DRM and convert it to htmlz, but the OCR looked like crap, even though the file looked alright on my Kindle.

Fortunately, the unDRM tool had an output that was many xhtml pages (one for each page of the book) and the xhtml had svgs (Scalable Vector Graphics) to display the letters and images. I could view the book just fine using Firefox, but that was clumsy and required me to port the 300+ files around. Trying to use Calibre failed miserably (I got an error a quarter the way through, but even then the book was 180MB in size).

So, I messed around with many different tools and finally arrived at a solution. It's not the most elegant solution in the world, but it is (roughly) cross platform.

Note: This guide assumes that you have already (and ethically) stripped the DRM from the topaz book in question. This thread is not meant to discuss said stripping, there are other forums for that.

Say it with me now I will NOT strip DRM from a book I do not own, nor will I "liberate" books for the intent of piracy. Got it? Good.

Requirements and things you need:
-The book. My tool outputs the following message and leaves three formats.

Code:

Book Successfully generated
Creating NoDRM HTMLZ Archive
Creating SVG HTMLZ Archive
Creating XML ZIP Archive

We need the SVG (HTMLZ) archive. Extract it. It's just a renamed zip file.
- An operating system that is either Windows XP, Ubuntu 10.4, or MacOS 10.6 or older. One of the programs (Prince) does not like Windows 7, so if that's all you got, pray it works in compatibility mode or use a VM. Mac people, I have not tested your system so, cross your fingers.
-Notepad++ (Open Source) Download Here
-Prince (Free for non commercial use)This is the most finicky program of the bunch
-PDF Split and Merge (Open Source) Download Here
-Jave JRE An unattended Ninite Installer/Updater
-Briss (Open Source, requires JRE) Project Home Page

Step 0:Install all above programs and verify that they run.

Step 1: Feel free to open the index_svg.xhtml page that is in the root of the unzipped book in Firefox or Safari. The book should look like it does on your ereader. Verify that in the same directory as index_svg.xhtml, you see at least two folders svg and img. You may see more.

Step 2: Looking at your book in firefox, you may notice some artifacts that are not your book, mainly the go forward and go back buttons and a zoomin zoomout sort of thing. Let's get rid of those using notepad++ (inspired by this site).
Open up page0000, page0001 and whatever your last page is in notepad++. Hopefully, your page0000 and page0001 are not very busy (maybe a title page or something) because that makes identifying the artifacts easier. In between your documents' "Body tags" (i.e. <body>[lots and lots of stuff</body>) you should see at least 4 "a" tags, one that says

Code:

<a href="javascript:ppage();"><[more stuff]</a>

Two that say

Code:

<a href="javascript:npage();"><[more stuff]</a>

and one that says

Code:

<div><a href="javascript:zoomin();">zoom in</a> - <a href="javascript:zoomout();">zoom out</a></div>

For me the zoomin, zoomout one was very near the bottom of the text.
We want to remove all of the ppage() scripts and all of the zoom scripts, so let's do that now. You may notice that the page0000 has a different ppage() than page0001, page0002 and all the rest of the pages and that's because if you are at the beginning of the book, you can't go any more backward. Remove the entire ppage() script from page0000 manually. Check in Firefox that you didn't delete something important and then copy the ppage() script from page0001.

Go to Search>Find in Files and paste the script into find, blank out replace, and choose the directory of all the individual pages. You can optionally put in a filter of *.xhtml Open up some of the pages in firefox and you should notice that the arrow to go backwards is gone. Do the same thing for the zoomin and zoom out. You may want to change the background color to white so you can do a replace

Code:

<body onLoad="setsize();" style="background-color:#777;

with

Code:

<body onLoad="setsize();" style="background-color:#FFF;

Since there are two npage() scripts (one from clicking the forward arrow and one from clicking the page), we need to remove the proper one. It should be the same in page0000 as it is in page0001 as it is in page0024 ... Recall that the last page has its own unique one as well. For referance, mine looks like

Code:

<a href="javascript:npage();"><svg id="nextsvg" viewBox="0 0 100 300" xmlns="http://www.w3.org/2000/svg" version="1.1" style="background-color:#777"><polygon points="5,5,5,295,95,150" fill="#AAAAAA" /></svg></a>

the id="nextsvg" is a dead give away that this generates the next page and isn't this page's main content.
If you choose the wrong one to remove, your page will not be displayed at all!

Check in Firefox that your book looks like you want it to.

Step 3: After that doozy of a step 2, it gets easier from here. Open up Prince and add all of your pages to the queue and convert them. Check that the output rendered correctly (Don't worry about the margins if you have them, we'll banish those later). This may take a little bit. You should now have n separate pdfs, where n is the number of pages in your book.

Step 4: Let's merge all the single page pdfs into one big one! Open up PDFSAM and select the merge/extract option and load all of the single pdfs into it, type in a file name and click Run. This shouldn't take too long.

Step 5: Adjust the Margins. Prince tends to print your xhtml into a letter sized 8.5 x 11 piece of "paper" and that tends to leave us with big margins. Open up Briss and load your single pdf from step 4 into it. You will see all of your even pages and your odd pages overlaid on top of each other and you can adjust the cropping margins to whatever you want by dragging the upper left and lower right hand corners of the shaded "1" area. The preview function here is your friend. Tell it to crop and you are done!

Other things to do:
Change the PDF metadata with becypdfmetaedit.
Take your output from step 2 and load it into sigil and make an epub! Note: this does not translate well into other formats, but it appears that epubs don't mind svgs.

sherman · 09-28-2011, 05:54 PM

Note that PrinceXML can create PDF's of a custom size, and with custom margins. You would need to create a custom css file to do this.

charles5410 · 01-02-2012, 12:06 PM

thanks for the instructions.
i tried one topaz ebook and the images of the final PDF file were missing.
the log of prince says "\svg\????.img can't open input file:No such file or directory".

the solution is to copy the images to the same folder of the .xhtml files, and change the code in .xhtml files. replace (xlink:href="../img/) with (xlink:href=").

instead of Notepad++, i recommend ultrareplace.

kid1412_net · 02-17-2012, 12:35 PM

Here's my creating PDF ebook flow:

OS: Fedora 14
Tools: Calibre, Foxit Phantom

1. Install latest version of Calibre (0.8.40)
2. Install Foxit Phantom through Wine
3. Unzip the SVG HTMLZ Archive (rename .HTMLZ -> .ZIP)
4. Add ebook to Calibre by openning index_svg.xhtml
5. Convert ebook to PDF files (the final PDF output may be 100MB)
6. Split the final PDF per one page (Open Foxit Phantom > Tools > Split)
7. Delete the blank pages and "zoom in-zoom out" pages (1/2 of total numbers of PDF files)
8. Use Foxit Phantom to merge all split PDF files to 1 final PDF files
9. Crop the next, previous button in PDF file.

And now you have your PDF file from SVG files

. Compare it's size with the original TOPAZ file !!!.

BCotton · 06-12-2012, 04:32 AM

Thank you Fschumaur and Sherman. It works!

If you use Fschumaur's method above and find that the output from PrinceXML in step 3 does not render correctly, you can change the size of the pdf pages in the output: http://www.princexml.com/doc/8.0/page-size. Instead of creating a separate css file, I added @page { size: A3 } to C:\Program Files\Prince\Engine\style\common.css. Now the contents of the resulting pdf files do not spill over into the next page.

gmer · 03-27-2015, 01:35 AM

Here is a bash script that does a direct conversion without having to 'print' them. It relies on hxselect, inkscape, and pdfunite which are commonly found in most Linux repos. Use it by passing the directory containing index_svg.xhtml as the first argument and it will generate a pdf subdirectory.

Code:

#!/bin/bash
mkdir $@/pdf

for f in $@/svg/page*html
do
	file=${f##*/}
	shortname=${file%%.xhtml}
	echo "processing $shortname"
	(cd $@/pdf ; cat ../svg/$file | hxselect '#svgimg' | inkscape -f /dev/stdin -A $shortname.pdf 2>/dev/null)
done

pdfunite $@/pdf/* $@/pdf/joined.pdf

09-27-2011, 07:31 PM	#1
Fschumaur Junior Member Posts: 5 Karma: 10 Join Date: Sep 2011 Device: Kindle 3G	Guide for converting Kindle Topaz (xhtml with svg) to PDF So, recently I was convert a book I purchased for my kindle into pdf so I could take it on my computer and read it w/o the Kindle 4 PC and I used an automagic tool to strip DRM and convert it to htmlz, but the OCR looked like crap, even though the file looked alright on my Kindle. Fortunately, the unDRM tool had an output that was many xhtml pages (one for each page of the book) and the xhtml had svgs (Scalable Vector Graphics) to display the letters and images. I could view the book just fine using Firefox, but that was clumsy and required me to port the 300+ files around. Trying to use Calibre failed miserably (I got an error a quarter the way through, but even then the book was 180MB in size). So, I messed around with many different tools and finally arrived at a solution. It's not the most elegant solution in the world, but it is (roughly) cross platform. Note: This guide assumes that you have already (and ethically) stripped the DRM from the topaz book in question. This thread is not meant to discuss said stripping, there are other forums for that. Say it with me now I will NOT strip DRM from a book I do not own, nor will I "liberate" books for the intent of piracy. Got it? Good. Requirements and things you need: -The book. My tool outputs the following message and leaves three formats. Code: Book Successfully generated Creating NoDRM HTMLZ Archive Creating SVG HTMLZ Archive Creating XML ZIP Archive We need the SVG (HTMLZ) archive. Extract it. It's just a renamed zip file. - An operating system that is either Windows XP, Ubuntu 10.4, or MacOS 10.6 or older. One of the programs (Prince) does not like Windows 7, so if that's all you got, pray it works in compatibility mode or use a VM. Mac people, I have not tested your system so, cross your fingers. -Notepad++ (Open Source) Download Here -Prince (Free for non commercial use)This is the most finicky program of the bunch -PDF Split and Merge (Open Source) Download Here -Jave JRE An unattended Ninite Installer/Updater -Briss (Open Source, requires JRE) Project Home Page Step 0:Install all above programs and verify that they run. Step 1: Feel free to open the index_svg.xhtml page that is in the root of the unzipped book in Firefox or Safari. The book should look like it does on your ereader. Verify that in the same directory as index_svg.xhtml, you see at least two folders svg and img. You may see more. Step 2: Looking at your book in firefox, you may notice some artifacts that are not your book, mainly the go forward and go back buttons and a zoomin zoomout sort of thing. Let's get rid of those using notepad++ (inspired by this site). Open up page0000, page0001 and whatever your last page is in notepad++. Hopefully, your page0000 and page0001 are not very busy (maybe a title page or something) because that makes identifying the artifacts easier. In between your documents' "Body tags" (i.e. <body>[lots and lots of stuff</body>) you should see at least 4 "a" tags, one that says Code: <a href="javascript:ppage();"><[more stuff]</a> Two that say Code: <a href="javascript:npage();"><[more stuff]</a> and one that says Code: <div><a href="javascript:zoomin();">zoom in</a> - <a href="javascript:zoomout();">zoom out</a></div> For me the zoomin, zoomout one was very near the bottom of the text. We want to remove all of the ppage() scripts and all of the zoom scripts, so let's do that now. You may notice that the page0000 has a different ppage() than page0001, page0002 and all the rest of the pages and that's because if you are at the beginning of the book, you can't go any more backward. Remove the entire ppage() script from page0000 manually. Check in Firefox that you didn't delete something important and then copy the ppage() script from page0001. Go to Search>Find in Files and paste the script into find, blank out replace, and choose the directory of all the individual pages. You can optionally put in a filter of .xhtml Open up some of the pages in firefox and you should notice that the arrow to go backwards is gone. Do the same thing for the zoomin and zoom out. You may want to change the background color to white so you can do a replace Code: <body onLoad="setsize();" style="background-color:#777; with Code: <body onLoad="setsize();" style="background-color:#FFF; Since there are two npage() scripts (one from clicking the forward arrow and one from clicking the page), we need to remove the proper one. It should be the same in page0000 as it is in page0001 as it is in page0024 ... Recall that the last page has its own unique one as well. For referance, mine looks like Code: <a href="javascript:npage();"><svg id="nextsvg" viewBox="0 0 100 300" xmlns="http://www.w3.org/2000/svg" version="1.1" style="background-color:#777"><polygon points="5,5,5,295,95,150" fill="#AAAAAA" /></svg></a> the id="nextsvg" is a dead give away that this generates the next page and isn't this page's main content. If you choose the wrong one to remove, your page will not be displayed at all!* Check in Firefox that your book looks like you want it to. Step 3: After that doozy of a step 2, it gets easier from here. Open up Prince and add all of your pages to the queue and convert them. Check that the output rendered correctly (Don't worry about the margins if you have them, we'll banish those later). This may take a little bit. You should now have n separate pdfs, where n is the number of pages in your book. Step 4: Let's merge all the single page pdfs into one big one! Open up PDFSAM and select the merge/extract option and load all of the single pdfs into it, type in a file name and click Run. This shouldn't take too long. Step 5: Adjust the Margins. Prince tends to print your xhtml into a letter sized 8.5 x 11 piece of "paper" and that tends to leave us with big margins. Open up Briss and load your single pdf from step 4 into it. You will see all of your even pages and your odd pages overlaid on top of each other and you can adjust the cropping margins to whatever you want by dragging the upper left and lower right hand corners of the shaded "1" area. The preview function here is your friend. Tell it to crop and you are done! Other things to do: Change the PDF metadata with becypdfmetaedit. Take your output from step 2 and load it into sigil and make an epub! Note: this does not translate well into other formats, but it appears that epubs don't mind svgs.

02-17-2012, 12:35 PM	#4
kid1412_net Junior Member Posts: 1 Karma: 10 Join Date: Feb 2012 Device: iPod touch	Converting TOPAZ ebook to PDF Here's my creating PDF ebook flow: OS: Fedora 14 Tools: Calibre, Foxit Phantom 1. Install latest version of Calibre (0.8.40) 2. Install Foxit Phantom through Wine 3. Unzip the SVG HTMLZ Archive (rename .HTMLZ -> .ZIP) 4. Add ebook to Calibre by openning index_svg.xhtml 5. Convert ebook to PDF files (the final PDF output may be 100MB) 6. Split the final PDF per one page (Open Foxit Phantom > Tools > Split) 7. Delete the blank pages and "zoom in-zoom out" pages (1/2 of total numbers of PDF files) 8. Use Foxit Phantom to merge all split PDF files to 1 final PDF files 9. Crop the next, previous button in PDF file. And now you have your PDF file from SVG files . Compare it's size with the original TOPAZ file !!!.

03-27-2015, 01:35 AM	#6
gmer Junior Member Posts: 1 Karma: 10 Join Date: Mar 2015 Device: android	Here is a bash script that does a direct conversion without having to 'print' them. It relies on hxselect, inkscape, and pdfunite which are commonly found in most Linux repos. Use it by passing the directory containing index_svg.xhtml as the first argument and it will generate a pdf subdirectory. Code: #!/bin/bash mkdir $@/pdf for f in $@/svg/pagehtml do file=${f##/} shortname=${file%%.xhtml} echo "processing $shortname" (cd $@/pdf ; cat ../svg/$file \| hxselect '#svgimg' \| inkscape -f /dev/stdin -A $shortname.pdf 2>/dev/null) done pdfunite $@/pdf/* $@/pdf/joined.pdf

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How can I convert topaz ebook from multiple xhtml's (SVG) to single pdf?	rglk	Workshop	3	11-28-2011 04:33 PM
Converting Topaz	sadievan	Amazon Kindle	5	09-27-2011 07:40 PM
Converting multiple text files to xhtml?	Spotnik	Sigil	19	04-12-2011 10:37 PM
Converting SVG graphics	navels	Sigil	5	03-15-2011 09:58 PM
Converting from Topaz, finally	chorpler	Kindle Developer's Corner	104	02-23-2010 01:45 AM

09-28-2011, 05:54 PM	#2
sherman Guru Posts: 850 Karma: 2641698 Join Date: Aug 2008 Location: Taranaki - NZ Device: Kobo Aura H2O, Kobo Forma	Note that PrinceXML can create PDF's of a custom size, and with custom margins. You would need to create a custom css file to do this.

01-02-2012, 12:06 PM	#3
charles5410 Junior Member Posts: 2 Karma: 10 Join Date: Nov 2008 Device: none	thanks for the instructions. i tried one topaz ebook and the images of the final PDF file were missing. the log of prince says "\svg\????.img can't open input file:No such file or directory". the solution is to copy the images to the same folder of the .xhtml files, and change the code in .xhtml files. replace (xlink:href="../img/) with (xlink:href="). instead of Notepad++, i recommend ultrareplace.

06-12-2012, 04:32 AM	#5
BCotton Connoisseur Posts: 57 Karma: 230 Join Date: Sep 2011 Device: Boox M90, Sony PRS-300, PB360+, HP Touchpad	Thank you Fschumaur and Sherman. It works! If you use Fschumaur's method above and find that the output from PrinceXML in step 3 does not render correctly, you can change the size of the pdf pages in the output: http://www.princexml.com/doc/8.0/page-size. Instead of creating a separate css file, I added @page { size: A3 } to C:\Program Files\Prince\Engine\style\common.css. Now the contents of the resulting pdf files do not spill over into the next page.

Advert

Advert