View Single Post
Old 02-12-2013, 08:20 PM   #12
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by jasontaylor7 View Post
Negative. All compression algorithms have a "if compression expands then store" feature.
First this 100% not true. TCR (your best compression) has no such feature.

You are not taking into account that many "compression" formats have header or other framing information they add to the file. Zip (while not technically a compression algorithm) and TCR are such formats. Every time you compress it creates a zip header and list of file entries. Zip a zip file 100 times and at a certain point you will start seeing the file size increase.

FYI. I wrote the TCR compression implementation used by calibre.

Another issue I see with your test is the formats your using. Lets take HTMLZ, PMLZ, TXT and TCR. HTMLZ and PMLZ both contain formatting information while TXT and TCR are text only (no formatting). So in your test you're not taking into account formatting. So your test is really, "Most efficient ereader format for storing only text without formatting."

I would argue that formatting is part of the book and losing formatting (I would't argue images) is detrimental. For example, removing new lines so you have a stream of characters on a single line will produce a smaller file than with your test. However, is a single line of text acceptable?

Some formats do lend themselves to compression more so than others. A binary file like RB and MOBI is going to be harder to compress compared to a TXT file. a TXT file (especially a written work like a book) is going to have a lot of repetition.

That said, I'm not saying figuring out which compression is best for ebooks isn't bad. I'm just saying your testing methodology needs some work.
user_none is offline   Reply With Quote