Quote:
Originally Posted by nekokami
Ah. I was assuming you'd grab a copy of the file. After all, when you look at a website, you're effectively grabbing a copy of the HTML file (or whatever is generated by the script that creates the page, if we're not talking about static pages).
|
Yes, you are, but the HTML file is not a compressed archive that must be opened and examined. And whether you can get to it at all depends upon the site. Does it require a login/password?
Even if it doesn't, you may not be able to grab the file in a neatly automated manner. Sites use a file called ROBOTS.TXT to specify what a web spider can search and what it shouldn't index. Spiders that ignore ROBOTS.TXT may just get their originating IP address blocked by the site they spider.
______
Dennis