MobileRead Forums - View Single Post - mobi to epub conversions have spelling errors

KevinH · 09-28-2011, 11:11 AM

Hi,

I tried his approach using prince but simplified it by slightly modifying the tools to *not* output the arrows and zoom info (and easy change btw) and then simply did the following:

prince *.svg -o mybook.pdf

The prince program will properly merge the pages into one pdf very nicely (so no need for a separate pdf merge program). I also cropped it with BRISS which works very nicely too.

The problem is as you guessed ... the resulting file sizes.

1. The original topaz ebook was only 4.1 meg in size.

2. After unpacking, you can see the original xml files (text-based) and image folder and it takes up only 14.9 meg and after zipping just 5.8 meg. This is the raw xml (text) description of the ebook that the svg images are built from.

3. The folder of svg files and images was over 59 meg. If you zipped it up (and .svgz is an allowed format for svg files) you end up with 17.8 meg. Not too bad in comparison to the original 4.1 meg.

The problem is in pdf form (after using prince and briss) the book required over 101 meg!

So converting simple text based drawing commands into images and storing the images actually takes up much much more space than the text which describes how to draw the pages images themselves!! In addition, you lose all of the OCR information which means you can't search it, and of course as a set of images, it can not be reflowed.

Too bad other ebook readers do not simply draw each page on the fly (pretty much what the Amazon e-reader does) from text based svg info. Or even better, if we could get the Calibre program to grok the text-based xml files, then no growth in file sizes would be necessary and the output (svg, versus ocr, versus pdf) could be generated directly from the true xml files that describe the ebook.

It also gives you an appreciation of just how well designed the topaz format really is in comparison to the pdf format for e-book applications.

KevinH

09-28-2011, 11:11 AM	#24
KevinH Sigil Developer Posts: 9,066 Karma: 6361556 Join Date: Nov 2009 Device: many	Hi, I tried his approach using prince but simplified it by slightly modifying the tools to not output the arrows and zoom info (and easy change btw) and then simply did the following: prince .svg -o mybook.pdf The prince program will properly merge the pages into one pdf very nicely (so no need for a separate pdf merge program). I also cropped it with BRISS which works very nicely too. The problem is as you guessed ... the resulting file sizes. 1. The original topaz ebook was only 4.1 meg in size. 2. After unpacking, you can see the original xml files (text-based) and image folder and it takes up only 14.9 meg and after zipping just 5.8 meg. This is the raw xml (text) description of the ebook that the svg images are built from. 3. The folder of svg files and images was over 59 meg. If you zipped it up (and .svgz is an allowed format for svg files) you end up with 17.8 meg. Not too bad in comparison to the original 4.1 meg. The problem is in pdf form (after using prince and briss) the book required over 101 meg! So converting simple text based drawing commands into images and storing the images actually takes up much much more space than the text which describes how to draw the pages images themselves!! In addition, you lose all of the OCR information which means you can't search it, and of course as a set of images, it can not be reflowed. Too bad other ebook readers do not simply draw each page on the fly (pretty much what the Amazon e-reader does) from text based svg info. Or even better, if we could get the Calibre program to grok the text-based xml files, then no growth in file sizes would be necessary and the output (svg, versus ocr, versus pdf) could be generated directly from the true xml files that describe the ebook. It also gives you an appreciation of just how well designed the topaz format really is in comparison to the pdf format for e-book applications. KevinH Last edited by KevinH; 09-28-2011 at 11:55 AM. Reason: updated with more info and fixed typos*