View Single Post
Old 05-17-2012, 07:13 AM   #230
JoeD
Guru
JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.
 
Posts: 895
Karma: 4383958
Join Date: Nov 2007
Device: na
Quote:
Originally Posted by HarryT View Post
I find that hard to believe. You can get a factor of 2 or perhaps 3 when you compress a text file, but a factor of 20? Unless the information in the file is massively redundant (lots of repeated strings) I just don't see how it could be done.
It's very very possible with text. I won't pretend to know anything about the algorithms used in detail, but if you google huffman encoding, that's one of the ways they can compress so heavily.

The numbers I used are real. I made a fresh copy of my apache log, 5.8MB in size (yes I rounded to 6MB in my OP . After gzip -9 compression, 180KB gzip'd, or if bzip2'd instead 121KB.

Don't forget, when it comes to logs, there can be quite a lot of repetition. For example, my private IP will occur numerous times, dates will be repeated multiple times, pages accessed might occur several times.

Last edited by JoeD; 05-17-2012 at 07:17 AM.
JoeD is offline   Reply With Quote