mobi to epub conversions have spelling errors - Page 2

DoctorOhh · 09-17-2011, 12:55 AM

Quote:

Originally Posted by ldolse

Unless they changed the drm plugin to do something radically different it doesn't create a .mobi file from topaz books, it creates a .zip or .htmlz file, depending on the plugin version. That's why I suggested the OP check the edit metadata screen. So to view it in the Calibre viewer it would be converted from one of these to ePub to view it.

You are absolutely correct.

The only thing I know for sure is that converting a file in calibre will not induce random spelling errors. True the encoding can cause problems but I've never mistaken bad encoding for spelling errors.

I have had older versions of the DeDRM tools not fully or correctly remove the drm and the result was a book that had what looked like garbled text intermittently through the book. Updating the tools used corrected this problem.

cybmole · 09-17-2011, 02:11 AM

so a topaz book has 2 levels - the visible text - which is actually images - and a hidden version, for indexing/searches, which is the result of an OCR process applied to the images.
calibre viewer converts and displays the latter; Kindle for PC presumably displays the former ?

and amazon don't tells us what format we're buying ?

DoctorOhh · 09-17-2011, 02:38 AM

Quote:

Originally Posted by cybmole

so a topaz book has 2 levels - the visible text - which is actually images - and a hidden version, for indexing/searches, which is the result of an OCR process applied to the images.

Correct. I link to the history of Topaz in this post.

Quote:

Originally Posted by cybmole

calibre viewer converts and displays the latter;

Yes, but only after a DeDRM tool/plugin converts the OCR portion to htmlz first.

Quote:

Originally Posted by cybmole

Kindle for PC presumably displays the former ?

Yes, the azw you display in Kindle for PC displays the glyphs.

Quote:

Originally Posted by cybmole

and amazon don't tells us what format we're buying ?

Amazon tells you that your buying a drm book that will work on your Kindle or Kindle for ... application.

If you download the sample of the book you can open it up in a text editor to see what it is under the covers. Below are the first lines of two purchased books viewed in a text editor.

TPZ0 cdictšV¤FßPcdkey3
Picking_Cotton KÓnoKÓnp BOOKMOBI

I've bolded the pertinent area that tells you which format the underlying book is created in.

TPZ0 = Topaz
BOOKMOBI = Mobi

dawnybros · 09-23-2011, 07:04 AM

Okay, I'm not at home at the moment to check the file headers, but why would a brand new just published book be in topaz format? It's unlikely. I'll check it out when I'm home.

user_none · 09-23-2011, 07:07 AM

Quote:

Originally Posted by dawnybros

Okay, I'm not at home at the moment to check the file headers, but why would a brand new just published book be in topaz format? It's unlikely.

Topaz is a newer format than MOBI... Why wouldn't a new just published book use the latest ebook format Amazon is pushing?

ldolse · 09-23-2011, 12:20 PM

Quote:

Originally Posted by dawnybros

Okay, I'm not at home at the moment to check the file headers, but why would a brand new just published book be in topaz format? It's unlikely. I'll check it out when I'm home.

Topaz actually provides a publisher that started with a print book a faster avenue to market than Mobi. I see lots of 'new' ebooks on Amazon start as Topaz. I've seen a number of users report that they get converted to a proper mobi ebook months after the initial publishing, but you won't get the newer format unless you complain to Amazon.

If anything Topaz has been getting more popular as Amazon's success increases.

edit: It probably also doesn't bother Amazon or the publisher that attempts to strip the DRM result in a sub-par user experience - e.g. this thread...

Fschumaur · 09-27-2011, 07:35 PM

Here is how I convert Topaz books to pdfs. It's much cleaner, and avoids the OCR which your drm tool is introducing.

https://www.mobileread.com/forums/sho...65#post1759765

DoctorOhh · 09-27-2011, 09:16 PM

Quote:

Originally Posted by Fschumaur

Here is how I convert Topaz books to pdfs. It's much cleaner, and avoids the OCR which your drm tool is introducing.

https://www.mobileread.com/forums/sho...65#post1759765

Thanks very much for the link.

I know you are aware of this but for the record its not calibre's DRM tool or Mobileread's DRM tool so saying your drm tool is a little ambiguous. Also The DRM tool doesn't introduce OCR, the OCR data is part of the original Topaz book created by Amazon.

That said, what is the final size of your converted PDF book using your method. Also is the Epub created through Sigil still in the 180meg range?

KevinH · 09-28-2011, 11:11 AM

Hi,

I tried his approach using prince but simplified it by slightly modifying the tools to *not* output the arrows and zoom info (and easy change btw) and then simply did the following:

prince *.svg -o mybook.pdf

The prince program will properly merge the pages into one pdf very nicely (so no need for a separate pdf merge program). I also cropped it with BRISS which works very nicely too.

The problem is as you guessed ... the resulting file sizes.

1. The original topaz ebook was only 4.1 meg in size.

2. After unpacking, you can see the original xml files (text-based) and image folder and it takes up only 14.9 meg and after zipping just 5.8 meg. This is the raw xml (text) description of the ebook that the svg images are built from.

3. The folder of svg files and images was over 59 meg. If you zipped it up (and .svgz is an allowed format for svg files) you end up with 17.8 meg. Not too bad in comparison to the original 4.1 meg.

The problem is in pdf form (after using prince and briss) the book required over 101 meg!

So converting simple text based drawing commands into images and storing the images actually takes up much much more space than the text which describes how to draw the pages images themselves!! In addition, you lose all of the OCR information which means you can't search it, and of course as a set of images, it can not be reflowed.

Too bad other ebook readers do not simply draw each page on the fly (pretty much what the Amazon e-reader does) from text based svg info. Or even better, if we could get the Calibre program to grok the text-based xml files, then no growth in file sizes would be necessary and the output (svg, versus ocr, versus pdf) could be generated directly from the true xml files that describe the ebook.

It also gives you an appreciation of just how well designed the topaz format really is in comparison to the pdf format for e-book applications.

KevinH

ldolse · 09-29-2011, 03:55 AM

There's definitely ways to get the pdf size down, but it probably requires using some other packages. Check this thread on diybookscanner.org:
My workflow for almost djvubind-equivalent PDFs...

I think you could skip the OCR part in that workflow (or possibly leverage the existing ocr text somehow), but convert the SVG to Black & White TIFF, stick that in the pdf, and then run pdfsizeopt.

I'm guessing this process would give you a 10-20 meg pdf. edit: didn't realize pdfsizeopt is linux/mac only...

09-28-2011, 11:11 AM	#24
KevinH Sigil Developer Posts: 9,771 Karma: 7000000 Join Date: Nov 2009 Device: many	Hi, I tried his approach using prince but simplified it by slightly modifying the tools to not output the arrows and zoom info (and easy change btw) and then simply did the following: prince .svg -o mybook.pdf The prince program will properly merge the pages into one pdf very nicely (so no need for a separate pdf merge program). I also cropped it with BRISS which works very nicely too. The problem is as you guessed ... the resulting file sizes. 1. The original topaz ebook was only 4.1 meg in size. 2. After unpacking, you can see the original xml files (text-based) and image folder and it takes up only 14.9 meg and after zipping just 5.8 meg. This is the raw xml (text) description of the ebook that the svg images are built from. 3. The folder of svg files and images was over 59 meg. If you zipped it up (and .svgz is an allowed format for svg files) you end up with 17.8 meg. Not too bad in comparison to the original 4.1 meg. The problem is in pdf form (after using prince and briss) the book required over 101 meg! So converting simple text based drawing commands into images and storing the images actually takes up much much more space than the text which describes how to draw the pages images themselves!! In addition, you lose all of the OCR information which means you can't search it, and of course as a set of images, it can not be reflowed. Too bad other ebook readers do not simply draw each page on the fly (pretty much what the Amazon e-reader does) from text based svg info. Or even better, if we could get the Calibre program to grok the text-based xml files, then no growth in file sizes would be necessary and the output (svg, versus ocr, versus pdf) could be generated directly from the true xml files that describe the ebook. It also gives you an appreciation of just how well designed the topaz format really is in comparison to the pdf format for e-book applications. KevinH Last edited by KevinH; 09-28-2011 at 11:55 AM. Reason: updated with more info and fixed typos*

09-29-2011, 03:55 AM	#25
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	There's definitely ways to get the pdf size down, but it probably requires using some other packages. Check this thread on diybookscanner.org: My workflow for almost djvubind-equivalent PDFs... I think you could skip the OCR part in that workflow (or possibly leverage the existing ocr text somehow), but convert the SVG to Black & White TIFF, stick that in the pdf, and then run pdfsizeopt. I'm guessing this process would give you a 10-20 meg pdf. edit: didn't realize pdfsizeopt is linux/mac only... Last edited by ldolse; 09-29-2011 at 03:58 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Disable TOC for Mobi conversions	BRGriff	Conversion	5	06-10-2011 05:21 PM
Spelling errors and such	starrlamia	General Discussions	29	11-29-2010 03:59 AM
best program for correcting typos / spelling in epub & mobi books ?	cybmole	Calibre	15	11-16-2010 06:22 AM
Conversions from RTF (to mobi/epub)	Gwen Morse	Calibre	6	10-14-2010 06:00 AM
Conversion to Mobi to ePub errors	erik_reader	Conversion	5	08-07-2010 02:03 AM

09-17-2011, 02:11 AM	#17
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	so a topaz book has 2 levels - the visible text - which is actually images - and a hidden version, for indexing/searches, which is the result of an OCR process applied to the images. calibre viewer converts and displays the latter; Kindle for PC presumably displays the former ? and amazon don't tells us what format we're buying ?

09-23-2011, 07:04 AM	#19
dawnybros Junior Member Posts: 4 Karma: 10 Join Date: Aug 2009 Device: Sony PRS600	Okay, I'm not at home at the moment to check the file headers, but why would a brand new just published book be in topaz format? It's unlikely. I'll check it out when I'm home.

09-27-2011, 07:35 PM	#22
Fschumaur Junior Member Posts: 5 Karma: 10 Join Date: Sep 2011 Device: Kindle 3G	Here is how I convert Topaz books to pdfs. It's much cleaner, and avoids the OCR which your drm tool is introducing. https://www.mobileread.com/forums/sho...65#post1759765