Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Amazon Kindle > Kindle Developer's Corner

Notices

Reply
 
Thread Tools Search this Thread
Old 01-25-2010, 08:48 AM   #346
Coconut
Junior Member
Coconut began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Kindle DX
This method has worked beautifully for me. I was able to generate an html version of one of my books. Given that I am an academic, I was happy to see that the format actually maintains the original structire of the book (pages, I mean). One of my main sources of unhappiness with kindle formatted books has been that reading them on my DX does not maintain the original page layout, which means I cannot cite from it -- no way to know which page I'm on.

I noticed an svg folder filled with xhtmls, which are rendered fantastically using firefox (I'm on ubuntu). How can I combine those xhtml's to a single PDF? There must be a way to do it, and filesize does not interest me for now. Any suggestions on how to convert the individual files containing SVG to a single PDF?
Coconut is offline   Reply With Quote
Old 01-25-2010, 10:07 AM   #347
orwell2k
Addict
orwell2k can extract oil from cheeseorwell2k can extract oil from cheeseorwell2k can extract oil from cheeseorwell2k can extract oil from cheeseorwell2k can extract oil from cheeseorwell2k can extract oil from cheeseorwell2k can extract oil from cheeseorwell2k can extract oil from cheeseorwell2k can extract oil from cheese
 
orwell2k's Avatar
 
Posts: 357
Karma: 1112
Join Date: Oct 2008
Location: Euroland
Device: PocketBook 360°, BeBook (Hanlin V3), iRex DR1000S, iPad
Quote:
Originally Posted by Coconut View Post
This method has worked beautifully for me. I was able to generate an html version of one of my books. Given that I am an academic, I was happy to see that the format actually maintains the original structire of the book (pages, I mean). One of my main sources of unhappiness with kindle formatted books has been that reading them on my DX does not maintain the original page layout, which means I cannot cite from it -- no way to know which page I'm on.

I noticed an svg folder filled with xhtmls, which are rendered fantastically using firefox (I'm on ubuntu). How can I combine those xhtml's to a single PDF? There must be a way to do it, and filesize does not interest me for now. Any suggestions on how to convert the individual files containing SVG to a single PDF?
I use a nice tool called "Multi-html converter" to join all the little html or xhtml files from an ePub extraction together to make a neat single HTML file, which can then be opened in BookDesigner to make FB2s, etc. I assume you can also easily generate a PDF from this file also. Acrobat Pro can take an HTML input and create a PDF.

Acrobat can also take multiple files and create a single PDF, so you can select all the little xhtml files, then select the correct order (if they are not logically named to already be in order) and then generate the PDF. As you have (I think) a collection of esentially pages (each xhtml) from the Topaz, the PDF created should match the pagination of the original. If I was to create a PDF from a set of ePub xhtml files, the pagination within chapters (or whatever defines the xhtml file splits) would probably not match.
orwell2k is offline   Reply With Quote
Advert
Old 01-25-2010, 10:43 AM   #348
Coconut
Junior Member
Coconut began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Kindle DX
Quote:
Originally Posted by orwell2k View Post
I use a nice tool called "Multi-html converter" to join all the little html or xhtml files from an ePub extraction together to make a neat single HTML file, which can then be opened in BookDesigner to make FB2s, etc. I assume you can also easily generate a PDF from this file also. Acrobat Pro can take an HTML input and create a PDF.

Acrobat can also take multiple files and create a single PDF, so you can select all the little xhtml files, then select the correct order (if they are not logically named to already be in order) and then generate the PDF. As you have (I think) a collection of esentially pages (each xhtml) from the Topaz, the PDF created should match the pagination of the original. If I was to create a PDF from a set of ePub xhtml files, the pagination within chapters (or whatever defines the xhtml file splits) would probably not match.
I appreciate that, however, Acrobat does not seem to know what to do with the xhtml's that were outputted in the SVG stage of the conversion, and those are really the ones I want. Acrobat simply states tht it cannot open the files because they are an unknown filetype.
Coconut is offline   Reply With Quote
Old 01-25-2010, 11:31 AM   #349
clarknova
Addict
clarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with others
 
clarknova's Avatar
 
Posts: 241
Karma: 2617
Join Date: Mar 2009
Location: Greenwood, SC
Device: Kindle 2
Quote:
Originally Posted by Coconut View Post
I appreciate that, however, Acrobat does not seem to know what to do with the xhtml's that were outputted in the SVG stage of the conversion, and those are really the ones I want. Acrobat simply states tht it cannot open the files because they are an unknown filetype.
Yeah, since these are compound documents (mixed xhtml and SVG), only the smartest of page renderers can figure them out. Acrobat Pro doesn't even support SVG (which is stupid since Adobe designed the SVG format).

This is the only way I know how:
1) Use the "-r" flag on gensvg.py to generate the raw SVG images (without the xhtml/javascript wrapper).
2) Use Illustrator to batch convert the SVG files into PDF files.
3) Use Acrobat Pro to combine the PDF files into one.

There are major drawbacks to this, however:
1) This requires a Mac or PC and very expensive copies of Illustrator and Acrobat Pro.
2) Illustrator sucks at rendering SVG correctly, and many of the pages are poor looking.
3) The filesize (even though you stated that you didn't care) is outrageous. I converted 65 pages into a 38Meg PDF file.

An Open-Source alternative would be to use InkScape to render the SVG files into PDF. I don't have Inkscape installed on any of my machines, so I don't know how good the output is. However, I do know that SVG is Inkscape's default file format so it ought to be reasonably good.

This page has a tutorial on using Inkscape and pdftk to create a pdf from multiple SVG images (and since it's command-line-based instead of GUI, this would be much quicker than the above).
clarknova is offline   Reply With Quote
Old 01-25-2010, 11:37 AM   #350
Coconut
Junior Member
Coconut began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Kindle DX
Quote:
Originally Posted by clarknova View Post
Yeah, since these are compound documents (mixed xhtml and SVG), only the smartest of page renderers can figure them out. Acrobat Pro doesn't even support SVG (which is stupid since Adobe designed the SVG format).
Thanks! I will give that a shot. Now that I know how to create the 'clean' svg's, I'll also see if I can convert those to djvu. If that works out, I can convert that to a reasonably sized pdf. I have a number of scanned documents, so I am familiar with bizarrely huge PDF's and that does not in and of itself bother me, since there are ways to reduce the filesize quite significantly. I'll also let you know about Inkscape, when I have a chance to try that out.

Perfect. I'm so pleased. This actually allows me to get around a serious issue (for academics, at least) with textbooks, since I can now get to a point where I can transform bought books so that they become 'citeable'.
Coconut is offline   Reply With Quote
Advert
Old 01-25-2010, 11:51 AM   #351
clarknova
Addict
clarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with others
 
clarknova's Avatar
 
Posts: 241
Karma: 2617
Join Date: Mar 2009
Location: Greenwood, SC
Device: Kindle 2
Quote:
Originally Posted by Coconut View Post
Thanks! I will give that a shot. Now that I know how to create the 'clean' svg's, I'll also see if I can convert those to djvu. If that works out, I can convert that to a reasonably sized pdf. I have a number of scanned documents, so I am familiar with bizarrely huge PDF's and that does not in and of itself bother me, since there are ways to reduce the filesize quite significantly. I'll also let you know about Inkscape, when I have a chance to try that out.

Perfect. I'm so pleased. This actually allows me to get around a serious issue (for academics, at least) with textbooks, since I can now get to a point where I can transform bought books so that they become 'citeable'.
Wow, I just tried the inkscape and pdftk solution, and that worked really well. Much better rendering than Illustrator, and the same book (using all 307 pages) was only 91 Meg.
clarknova is offline   Reply With Quote
Old 01-25-2010, 12:30 PM   #352
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,645
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi Clarknova,

Would it be any help (pdf file size-wise) to start with the html version of the book with only critical areas converted to svg's but the main part of the book being straight html.

For example, a new version of flatxml2html using the code used for the ornate letter A issue can automatically create svg images for just the "fixed" regions on the page and put img src style links to them right into the html while letting the bulk of the document remain html. This did wonders for the need to hand edit anything in my book but at the expense of more svg images and less ability to search for things (since they might be in images).

The question is would this result in a significantly reduced in size pdf (once converted)? Or would this buy us nothing?

Thanks,

KevinH
KevinH is offline   Reply With Quote
Old 01-25-2010, 12:48 PM   #353
Coconut
Junior Member
Coconut began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Kindle DX
Quote:
Originally Posted by KevinH View Post
Hi Clarknova,

Would it be any help (pdf file size-wise) to start with the html version of the book with only critical areas converted to svg's but the main part of the book being straight html.

For example, a new version of flatxml2html using the code used for the ornate letter A issue can automatically create svg images for just the "fixed" regions on the page and put img src style links to them right into the html while letting the bulk of the document remain html. This did wonders for the need to hand edit anything in my book but at the expense of more svg images and less ability to search for things (since they might be in images).

The question is would this result in a significantly reduced in size pdf (once converted)? Or would this buy us nothing?

Thanks,

KevinH
It's probably even simpler than that. The html outputted at the end of the conversion process maintains information on pagination. It should be fairly straightforward to transform the html so that actual page breaks occur in the right places, as well as images.

First I'm going to take a look at the PDF's we can produce, and then move on from there. There's nothing to stop me from feeding that PDF back through OCR and output a text-based PDF with images.

Edit:
using ubuntu, I installed the librsvg2-bin package, which I used for conversion. The commandline I used -- in svg directory -- was "for i in page*.svg; do rsvg-convert -a -f pdf $i -o `echo $i | sed -e ' s/svg$/pdf/'`; done"

This created individual pdf's for each page. A total of 305 pages, at 197 megabytes. I combined those using Acrobat, and then ran 'optimize for OCR'. The resulting file is beautiful, with all images, and smooth, and weighs in at 3407K. Awesome.

Last edited by Coconut; 01-25-2010 at 02:02 PM.
Coconut is offline   Reply With Quote
Old 01-25-2010, 03:18 PM   #354
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,645
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi Coconut,

Exactly what is "optimize for OCR"? Is this an Acrobat Pro function? Is there opensource that can do the same thing?

KevinH
KevinH is offline   Reply With Quote
Old 01-25-2010, 03:37 PM   #355
clarknova
Addict
clarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with others
 
clarknova's Avatar
 
Posts: 241
Karma: 2617
Join Date: Mar 2009
Location: Greenwood, SC
Device: Kindle 2
Quote:
Originally Posted by Coconut View Post
using ubuntu, I installed the librsvg2-bin package, which I used for conversion. The commandline I used -- in svg directory -- was "for i in page*.svg; do rsvg-convert -a -f pdf $i -o `echo $i | sed -e ' s/svg$/pdf/'`; done"

This created individual pdf's for each page. A total of 305 pages, at 197 megabytes. I combined those using Acrobat, and then ran 'optimize for OCR'. The resulting file is beautiful, with all images, and smooth, and weighs in at 3407K. Awesome.
Yeah, I ended up doing something similar using Inkscape to render each page to a huge PNG (1200dpi!) then importing them in Acrobat, using Acrobat to OCR the pages and then optimizing it. The result isn't as pretty as the SVG (because the glyphs have been rasterized) but it's totally usable as a searchable PDF that retains the original book formatting at an acceptable filesize.

Unfortunately, I have no use for PDFs (since PDF isn't an ebook format). But for people that do, this is certainly an option, providing they have Acrobat Pro.

(Acrobat's OCR is neat. There are a few more OCR errors than in the original Topaz file, but it attempts to preserve style -- though not very well...)

Kevin: All of the open source OCR stuff is pretty obsolete and useless. The errors tend to be way more than in the Topaz file or even Adobe's OCR.

Personally, I find the genhtml to be the most usable. I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.
clarknova is offline   Reply With Quote
Old 01-25-2010, 03:48 PM   #356
Coconut
Junior Member
Coconut began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Kindle DX
Quote:
Originally Posted by clarknova View Post
Yeah, I ended up doing something similar using Inkscape to render each page to a huge PNG (1200dpi!) then importing them in Acrobat, using Acrobat to OCR the pages and then optimizing it. The result isn't as pretty as the SVG (because the glyphs have been rasterized) but it's totally usable as a searchable PDF that retains the original book formatting at an acceptable filesize.

Unfortunately, I have no use for PDFs (since PDF isn't an ebook format). But for people that do, this is certainly an option, providing they have Acrobat Pro.

(Acrobat's OCR is neat. There are a few more OCR errors than in the original Topaz file, but it attempts to preserve style -- though not very well...)

Kevin: All of the open source OCR stuff is pretty obsolete and useless. The errors tend to be way more than in the Topaz file or even Adobe's OCR.

Personally, I find the genhtml to be the most usable. I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.
The optimize for OCR (not OCR itself, just image adjustment) is a function in Acrobat. It neatly shades the rasterized images used to reduce filesize. The size I used was much smaller than you did, since there is really no point in that kind of resolution unless you plan on publishing the thing. I go for a size that's easily readable on my kindle -- again, the interest is in proper pagination -- and other screens. I'm really very happy with how it came out.

For OCR I actually used Finereader, which does a great job. The pdf I end up with is essentially error free. Finereader can also export to a variety of formats (paged and non-paged). I would not be surprised if html outputted by it surpasses what we've been able to produce, since it retains formatting. I'll try that later. Do we have a standard text to use for conversion and comparison of different methods? It's really the only way to determine what works best.

Last edited by Coconut; 01-25-2010 at 03:51 PM.
Coconut is offline   Reply With Quote
Old 01-25-2010, 03:53 PM   #357
clarknova
Addict
clarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with othersclarknova plays well with others
 
clarknova's Avatar
 
Posts: 241
Karma: 2617
Join Date: Mar 2009
Location: Greenwood, SC
Device: Kindle 2
Quote:
Originally Posted by Coconut View Post
The optimize for OCR (not OCR itself, just image adjustment) is a function in Acrobat.
You must have a different version than me. CS4 has OCR functions, and then just the "Optimize Scanned PDF" which shrinks down my giant raster images (after I've done OCR) into something more manageable.
clarknova is offline   Reply With Quote
Old 01-25-2010, 06:07 PM   #358
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,645
Karma: 5433388
Join Date: Nov 2009
Device: many
> I just have to convert (using imagemagick or illustrator or inkscape or whatever) the Monogram and Table svgs that get generated into PNG/JPEG so I can create an ePub out of the data.

I thought that svg was part of the epub spec? As long as they are not animations, I thought svg graphics did not need to be converted to png or jpeg when used in epub?

At least that is what the Mobileread Wiki says. I will make one and see if it works on my Sony reader.

Thanks,

Kevin
KevinH is offline   Reply With Quote
Old 01-25-2010, 09:20 PM   #359
Coconut
Junior Member
Coconut began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Kindle DX
Quote:
Originally Posted by clarknova View Post
You must have a different version than me. CS4 has OCR functions, and then just the "Optimize Scanned PDF" which shrinks down my giant raster images (after I've done OCR) into something more manageable.
I misspoke. That's the one.
Coconut is offline   Reply With Quote
Old 03-19-2010, 02:33 PM   #360
bookwurm70
Addict
bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.bookwurm70 considers 'yay' to be a thoroughly cromulent word.
 
bookwurm70's Avatar
 
Posts: 265
Karma: 89314
Join Date: Nov 2009
Location: Southern Illinois
Device: eSlick, Pocketbook IQ, iPad, Kobo Aura, Kobo Aura ONE
feeling stupid

Quote:
Originally Posted by labba View Post
from DarkRevers Blog:
So, I have some experience stripping DRM. I've done it from PDB, MOBI, and EPUB. I'm working now on Kindle Topaz. I do not understand the directions that I have found. Specifically what to do with this line:

cmbtc_dump.py -d -o TARGETDIR [-p pid] YOURBOOKNAMEHERE

Do you do this in commandprompt, just like for pdb or mobi books? I'm not getting it to work at all. Not even an error message. It just takes me back to my command prompt:

c:\python26

Any help would be appreciated
bookwurm70 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
discovering and loving this fb.2 reader.. oncdoc Astak EZReader 2 04-19-2010 06:05 PM
K4 Mac or PC Where are K4PC files? lmittell Amazon Kindle 3 01-06-2010 01:04 AM
Where is the PID on Pocket Pro, ADE and K4PC? rxsz Astak EZReader 7 12-20-2009 05:29 AM
Free on Kindle - Discovering Dani koland Deals and Resources (No Self-Promotion or Affiliate Links) 0 09-28-2009 09:57 AM
Kindle PID from Mobi PID - can anyone do it? delphidb96 Workshop 2 04-27-2009 04:42 PM


All times are GMT -4. The time now is 09:08 PM.


MobileRead.com is a privately owned, operated and funded community.