View Single Post
Old 05-17-2012, 11:36 AM   #241
JoeD
Guru
JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.JoeD ought to be getting tired of karma fortunes by now.
 
Posts: 895
Karma: 4383958
Join Date: Nov 2007
Device: na
Quote:
Originally Posted by HarryT View Post
Yea - if you have 100,000 lines the same, they will essentially be replaced by one marker saying "repeat this 100,000 times". If you made your file 10 million lines long, it would still compress to 8k . Artificial cases like that aren't a terribly good test, because they will compress in a way that "real" data doesn't.
Agreed.

For a client list to be realistic, every single line must be a unique email address. There will be repeated domains in the list where you're keeping a note of every employee/contact at a given client's company, but the names themselves are then going to vary.

The reason I said very unlikely rather than impossible, is you can have a 10k file expand to a gigabyte or more if the data is really heavily repetitive (whether it's useful data or not is a different matter :P). It's possible the file could be compressed highly if you only have a handful of different companies listed, giving a large repetition of domain names.

To test that though, you'd have to at the very least generate a file with realistic first/last names. You could allow the first or lastname to be repeated throughout a single or multiple companies, although, chances are you'd find a wide range of names for any one company especially last names. However the first.lastname pair itself would have to be unique for any single company.

If it's a client list of just 55 companies with 10,000 employees at each, there's going to be plenty of domain repeats, but would the names push it over? If it's a client list of more than 55 companies which would allow name pairs to repeat more, the domains then might push it over.

Last edited by JoeD; 05-17-2012 at 11:43 AM.
JoeD is offline   Reply With Quote