Adventures in discovering the K4PC PID. - Page 24

Coconut · 01-25-2010, 08:48 AM

This method has worked beautifully for me. I was able to generate an html version of one of my books. Given that I am an academic, I was happy to see that the format actually maintains the original structire of the book (pages, I mean). One of my main sources of unhappiness with kindle formatted books has been that reading them on my DX does not maintain the original page layout, which means I cannot cite from it -- no way to know which page I'm on.

I noticed an svg folder filled with xhtmls, which are rendered fantastically using firefox (I'm on ubuntu). How can I combine those xhtml's to a single PDF? There must be a way to do it, and filesize does not interest me for now. Any suggestions on how to convert the individual files containing SVG to a single PDF?

orwell2k · 01-25-2010, 10:07 AM

Quote:

Originally Posted by Coconut

This method has worked beautifully for me. I was able to generate an html version of one of my books. Given that I am an academic, I was happy to see that the format actually maintains the original structire of the book (pages, I mean). One of my main sources of unhappiness with kindle formatted books has been that reading them on my DX does not maintain the original page layout, which means I cannot cite from it -- no way to know which page I'm on.

I noticed an svg folder filled with xhtmls, which are rendered fantastically using firefox (I'm on ubuntu). How can I combine those xhtml's to a single PDF? There must be a way to do it, and filesize does not interest me for now. Any suggestions on how to convert the individual files containing SVG to a single PDF?

I use a nice tool called "Multi-html converter" to join all the little html or xhtml files from an ePub extraction together to make a neat single HTML file, which can then be opened in BookDesigner to make FB2s, etc. I assume you can also easily generate a PDF from this file also. Acrobat Pro can take an HTML input and create a PDF.

Acrobat can also take multiple files and create a single PDF, so you can select all the little xhtml files, then select the correct order (if they are not logically named to already be in order) and then generate the PDF. As you have (I think) a collection of esentially pages (each xhtml) from the Topaz, the PDF created should match the pagination of the original. If I was to create a PDF from a set of ePub xhtml files, the pagination within chapters (or whatever defines the xhtml file splits) would probably not match.

Coconut · 01-25-2010, 10:43 AM

Quote:

Originally Posted by orwell2k

I use a nice tool called "Multi-html converter" to join all the little html or xhtml files from an ePub extraction together to make a neat single HTML file, which can then be opened in BookDesigner to make FB2s, etc. I assume you can also easily generate a PDF from this file also. Acrobat Pro can take an HTML input and create a PDF.

Acrobat can also take multiple files and create a single PDF, so you can select all the little xhtml files, then select the correct order (if they are not logically named to already be in order) and then generate the PDF. As you have (I think) a collection of esentially pages (each xhtml) from the Topaz, the PDF created should match the pagination of the original. If I was to create a PDF from a set of ePub xhtml files, the pagination within chapters (or whatever defines the xhtml file splits) would probably not match.

I appreciate that, however, Acrobat does not seem to know what to do with the xhtml's that were outputted in the SVG stage of the conversion, and those are really the ones I want. Acrobat simply states tht it cannot open the files because they are an unknown filetype.

clarknova · 01-25-2010, 11:31 AM

Quote:

Originally Posted by Coconut

I appreciate that, however, Acrobat does not seem to know what to do with the xhtml's that were outputted in the SVG stage of the conversion, and those are really the ones I want. Acrobat simply states tht it cannot open the files because they are an unknown filetype.

Yeah, since these are compound documents (mixed xhtml and SVG), only the smartest of page renderers can figure them out. Acrobat Pro doesn't even support SVG (which is stupid since Adobe designed the SVG format).

This is the only way I know how:
1) Use the "-r" flag on gensvg.py to generate the raw SVG images (without the xhtml/javascript wrapper).
2) Use Illustrator to batch convert the SVG files into PDF files.
3) Use Acrobat Pro to combine the PDF files into one.

There are major drawbacks to this, however:
1) This requires a Mac or PC and very expensive copies of Illustrator and Acrobat Pro.
2) Illustrator sucks at rendering SVG correctly, and many of the pages are poor looking.
3) The filesize (even though you stated that you didn't care) is outrageous. I converted 65 pages into a 38Meg PDF file.

An Open-Source alternative would be to use InkScape to render the SVG files into PDF. I don't have Inkscape installed on any of my machines, so I don't know how good the output is. However, I do know that SVG is Inkscape's default file format so it ought to be reasonably good.

This page has a tutorial on using Inkscape and pdftk to create a pdf from multiple SVG images (and since it's command-line-based instead of GUI, this would be much quicker than the above).

Coconut · 01-25-2010, 11:37 AM

Quote:

Originally Posted by clarknova

Yeah, since these are compound documents (mixed xhtml and SVG), only the smartest of page renderers can figure them out. Acrobat Pro doesn't even support SVG (which is stupid since Adobe designed the SVG format).

Thanks! I will give that a shot. Now that I know how to create the 'clean' svg's, I'll also see if I can convert those to djvu. If that works out, I can convert that to a reasonably sized pdf. I have a number of scanned documents, so I am familiar with bizarrely huge PDF's and that does not in and of itself bother me, since there are ways to reduce the filesize quite significantly. I'll also let you know about Inkscape, when I have a chance to try that out.

Perfect. I'm so pleased. This actually allows me to get around a serious issue (for academics, at least) with textbooks, since I can now get to a point where I can transform bought books so that they become 'citeable'.

clarknova · 01-25-2010, 11:51 AM

Quote:

Originally Posted by Coconut

Thanks! I will give that a shot. Now that I know how to create the 'clean' svg's, I'll also see if I can convert those to djvu. If that works out, I can convert that to a reasonably sized pdf. I have a number of scanned documents, so I am familiar with bizarrely huge PDF's and that does not in and of itself bother me, since there are ways to reduce the filesize quite significantly. I'll also let you know about Inkscape, when I have a chance to try that out.

Perfect. I'm so pleased. This actually allows me to get around a serious issue (for academics, at least) with textbooks, since I can now get to a point where I can transform bought books so that they become 'citeable'.

Wow, I just tried the inkscape and pdftk solution, and that worked really well. Much better rendering than Illustrator, and the same book (using all 307 pages) was only 91 Meg.

KevinH · 01-25-2010, 12:30 PM

Hi Clarknova,

Would it be any help (pdf file size-wise) to start with the html version of the book with only critical areas converted to svg's but the main part of the book being straight html.

For example, a new version of flatxml2html using the code used for the ornate letter A issue can automatically create svg images for just the "fixed" regions on the page and put img src style links to them right into the html while letting the bulk of the document remain html. This did wonders for the need to hand edit anything in my book but at the expense of more svg images and less ability to search for things (since they might be in images).

The question is would this result in a significantly reduced in size pdf (once converted)? Or would this buy us nothing?

Thanks,

KevinH

Coconut · 01-25-2010, 12:48 PM

Quote:

Originally Posted by KevinH

Hi Clarknova,

Would it be any help (pdf file size-wise) to start with the html version of the book with only critical areas converted to svg's but the main part of the book being straight html.

For example, a new version of flatxml2html using the code used for the ornate letter A issue can automatically create svg images for just the "fixed" regions on the page and put img src style links to them right into the html while letting the bulk of the document remain html. This did wonders for the need to hand edit anything in my book but at the expense of more svg images and less ability to search for things (since they might be in images).

The question is would this result in a significantly reduced in size pdf (once converted)? Or would this buy us nothing?

Thanks,

KevinH

It's probably even simpler than that. The html outputted at the end of the conversion process maintains information on pagination. It should be fairly straightforward to transform the html so that actual page breaks occur in the right places, as well as images.

First I'm going to take a look at the PDF's we can produce, and then move on from there. There's nothing to stop me from feeding that PDF back through OCR and output a text-based PDF with images.

Edit:
using ubuntu, I installed the librsvg2-bin package, which I used for conversion. The commandline I used -- in svg directory -- was "for i in page*.svg; do rsvg-convert -a -f pdf $i -o `echo $i | sed -e ' s/svg$/pdf/'`; done"

This created individual pdf's for each page. A total of 305 pages, at 197 megabytes. I combined those using Acrobat, and then ran 'optimize for OCR'. The resulting file is beautiful, with all images, and smooth, and weighs in at 3407K. Awesome.

KevinH · 01-25-2010, 03:18 PM

Hi Coconut,

Exactly what is "optimize for OCR"? Is this an Acrobat Pro function? Is there opensource that can do the same thing?

KevinH

clarknova · 01-25-2010, 03:37 PM

Quote:

Originally Posted by Coconut

using ubuntu, I installed the librsvg2-bin package, which I used for conversion. The commandline I used -- in svg directory -- was "for i in page*.svg; do rsvg-convert -a -f pdf $i -o `echo $i | sed -e ' s/svg$/pdf/'`; done"

This created individual pdf's for each page. A total of 305 pages, at 197 megabytes. I combined those using Acrobat, and then ran 'optimize for OCR'. The resulting file is beautiful, with all images, and smooth, and weighs in at 3407K. Awesome.

Yeah, I ended up doing something similar using Inkscape to render each page to a huge PNG (1200dpi!) then importing them in Acrobat, using Acrobat to OCR the pages and then optimizing it. The result isn't as pretty as the SVG (because the glyphs have been rasterized) but it's totally usable as a searchable PDF that retains the original book formatting at an acceptable filesize.

Unfortunately, I have no use for PDFs (since PDF isn't an ebook format). But for people that do, this is certainly an option, providing they have Acrobat Pro.

(Acrobat's OCR is neat. There are a few more OCR errors than in the original Topaz file, but it attempts to preserve style -- though not very well...)

Kevin: All of the open source OCR stuff is pretty obsolete and useless. The errors tend to be way more than in the Topaz file or even Adobe's OCR.

Personally, I find the genhtml to be the most usable. I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.

Coconut · 01-25-2010, 03:48 PM

Quote:

Originally Posted by clarknova

Yeah, I ended up doing something similar using Inkscape to render each page to a huge PNG (1200dpi!) then importing them in Acrobat, using Acrobat to OCR the pages and then optimizing it. The result isn't as pretty as the SVG (because the glyphs have been rasterized) but it's totally usable as a searchable PDF that retains the original book formatting at an acceptable filesize.

Unfortunately, I have no use for PDFs (since PDF isn't an ebook format). But for people that do, this is certainly an option, providing they have Acrobat Pro.

(Acrobat's OCR is neat. There are a few more OCR errors than in the original Topaz file, but it attempts to preserve style -- though not very well...)

Kevin: All of the open source OCR stuff is pretty obsolete and useless. The errors tend to be way more than in the Topaz file or even Adobe's OCR.

Personally, I find the genhtml to be the most usable. I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.

The optimize for OCR (not OCR itself, just image adjustment) is a function in Acrobat. It neatly shades the rasterized images used to reduce filesize. The size I used was much smaller than you did, since there is really no point in that kind of resolution unless you plan on publishing the thing. I go for a size that's easily readable on my kindle -- again, the interest is in proper pagination -- and other screens. I'm really very happy with how it came out.

For OCR I actually used Finereader, which does a great job. The pdf I end up with is essentially error free. Finereader can also export to a variety of formats (paged and non-paged). I would not be surprised if html outputted by it surpasses what we've been able to produce, since it retains formatting. I'll try that later. Do we have a standard text to use for conversion and comparison of different methods? It's really the only way to determine what works best.

clarknova · 01-25-2010, 03:53 PM

Quote:

Originally Posted by Coconut

The optimize for OCR (not OCR itself, just image adjustment) is a function in Acrobat.

You must have a different version than me. CS4 has OCR functions, and then just the "Optimize Scanned PDF" which shrinks down my giant raster images (after I've done OCR) into something more manageable.

KevinH · 01-25-2010, 06:07 PM

> I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.

I thought that svg was part of the epub spec? As long as they are not animations, I thought svg graphics did not need to be converted to png or jpeg when used in epub?

At least that is what the Mobileread Wiki says. I will make one and see if it works on my Sony reader.

Thanks,

Kevin

Coconut · 01-25-2010, 09:20 PM

Quote:

Originally Posted by clarknova

You must have a different version than me. CS4 has OCR functions, and then just the "Optimize Scanned PDF" which shrinks down my giant raster images (after I've done OCR) into something more manageable.

I misspoke. That's the one.

bookwurm70 · 03-19-2010, 02:33 PM

Quote:

Originally Posted by labba

from DarkRevers Blog:

So, I have some experience stripping DRM. I've done it from PDB, MOBI, and EPUB. I'm working now on Kindle Topaz. I do not understand the directions that I have found. Specifically what to do with this line:

cmbtc_dump.py -d -o TARGETDIR [-p pid] YOURBOOKNAMEHERE

Do you do this in commandprompt, just like for pdb or mobi books? I'm not getting it to work at all. Not even an error message. It just takes me back to my command prompt:

c:\python26

Any help would be appreciated

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
discovering and loving this fb.2 reader..	oncdoc	Astak EZReader	2	04-19-2010 06:05 PM
K4 Mac or PC Where are K4PC files?	lmittell	Amazon Kindle	3	01-06-2010 01:04 AM
Where is the PID on Pocket Pro, ADE and K4PC?	rxsz	Astak EZReader	7	12-20-2009 05:29 AM
Free on Kindle - Discovering Dani	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	0	09-28-2009 09:57 AM
Kindle PID from Mobi PID - can anyone do it?	delphidb96	Workshop	2	04-27-2009 04:42 PM

01-25-2010, 08:48 AM	#346
Coconut Junior Member Posts: 7 Karma: 10 Join Date: Jan 2010 Device: Kindle DX	This method has worked beautifully for me. I was able to generate an html version of one of my books. Given that I am an academic, I was happy to see that the format actually maintains the original structire of the book (pages, I mean). One of my main sources of unhappiness with kindle formatted books has been that reading them on my DX does not maintain the original page layout, which means I cannot cite from it -- no way to know which page I'm on. I noticed an svg folder filled with xhtmls, which are rendered fantastically using firefox (I'm on ubuntu). How can I combine those xhtml's to a single PDF? There must be a way to do it, and filesize does not interest me for now. Any suggestions on how to convert the individual files containing SVG to a single PDF?

01-25-2010, 12:30 PM	#352
KevinH Sigil Developer Posts: 7,645 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Clarknova, Would it be any help (pdf file size-wise) to start with the html version of the book with only critical areas converted to svg's but the main part of the book being straight html. For example, a new version of flatxml2html using the code used for the ornate letter A issue can automatically create svg images for just the "fixed" regions on the page and put img src style links to them right into the html while letting the bulk of the document remain html. This did wonders for the need to hand edit anything in my book but at the expense of more svg images and less ability to search for things (since they might be in images). The question is would this result in a significantly reduced in size pdf (once converted)? Or would this buy us nothing? Thanks, KevinH

01-25-2010, 03:18 PM	#354
KevinH Sigil Developer Posts: 7,645 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Coconut, Exactly what is "optimize for OCR"? Is this an Acrobat Pro function? Is there opensource that can do the same thing? KevinH

01-25-2010, 06:07 PM	#358
KevinH Sigil Developer Posts: 7,645 Karma: 5433388 Join Date: Nov 2009 Device: many	> I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data. I thought that svg was part of the epub spec? As long as they are not animations, I thought svg graphics did not need to be converted to png or jpeg when used in epub? At least that is what the Mobileread Wiki says. I will make one and see if it works on my Sony reader. Thanks, Kevin

Advert

Advert