Quote:
Originally Posted by HarryT
I find that hard to believe. You can get a factor of 2 or perhaps 3 when you compress a text file, but a factor of 20? Unless the information in the file is massively redundant (lots of repeated strings) I just don't see how it could be done.
|
It's very very possible with text. I won't pretend to know anything about the algorithms used in detail, but if you google huffman encoding, that's one of the ways they can compress so heavily.
The numbers I used are real. I made a fresh copy of my apache log, 5.8MB in size (yes I rounded to 6MB in my OP

. After gzip -9 compression, 180KB gzip'd, or if bzip2'd instead 121KB.
Don't forget, when it comes to logs, there can be quite a lot of repetition. For example, my private IP will occur numerous times, dates will be repeated multiple times, pages accessed might occur several times.