View Single Post
Old 06-20-2005, 09:06 AM   #7
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
Quote:
Originally Posted by wwang
I know the site is waiting or doing other processing that I am not going to overload it by doing an outrageous amount of connections.
How do you KNOW the site is doing processing, and not just queuing up your requests? The more requests you make, the less others can make. Webservers don't just arbitrarily allow an infinite number of connections you know. If you hit the server with 50 requests at once, and the maximum allowed requests is set to say... 500, then you just ate a good portion of reqeuests that other users could have used.

Quote:
Just something more then 2 to get around a limitation in the plucker code problem listed above in this same thread.
You mean limitation in Sunrise, not Plucker.

Quote:
I know before we had sunrise, we had multiple threads for the same file, and there was a WAY I could get more then 4, it may have been a bug, but there still was a way.
Just be careful, you might find your IP or your host locked out if you pound servers too fast with requests. I know I've personally locked out over 900 unique hosts for slamming domains I host with their mobile spider tools, because they didn't add a delay between requests, or they ignored robots.txt and spidered content they weren't allowed to spider, or many other reasons.

I'm not alone here either, lots of system administrators are blocking and locking out misbehaving spiders via IP or netblock. Don't fall into the trap. Make sure you add a delay between requests and make sure you're parsing robots.txt properly with your spidering tool.

Quote:
I know that the server is not competely busy and eventually I know there are many files I am not going to be requesting because the files are not local to the remote host so I will not be downloading many of the embedded files.
Might I ask how you know that the server isn't busy? How do you know its one server involved? Do you know what processor and RAM it has? Do you run the server?

Seriously, I run a farm of VERY busy servers here doing lots of things (cvs/svn, mail, content generation, dbms, torrent trackers for distributing Free Software projects, mirroring major projects, etc.), and while the webservers may only see 200/300 connections per-second, those particular box are VERY busy doing other things, which takes away RAM, threads, and so on from the webserver's processing.

Quote:
I know that no matter what you do, if you overload the server too much, you are not going to get the data any faster! :-)
...or at all. Just be careful you don't get yourself, or your provider's netblock banned for slamming a server too fast or too hard.
hacker is offline