View Single Post
Old 03-14-2005, 10:29 PM   #21
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
Quote:
Originally Posted by Laurens
Last time I checked, Plucker Desktop came configured with global exclusion filters for well-known ad URLs (windcaster and such). Did no-one complain about their lost ad revenue?
No, because when those were created, most of those advertisers were forcing garbage ads, spyware, popups and other trash on the users. Better off without them, in most cases. Its a fine line, to be sure.

Quote:
Plucker and iSilo also ignore robots.txt, don't they? Now why is this a problem all of a sudden? And how does ignoring robots.txt make caching irrelevant?
I think you mean Plucker's python distiller, not Plucker itself.

Plucker is a viewer, primarily, which supports a document format that can be produced by many tools. The two most-popular document creators for Plucker are currently the Python Distiller (used in Plucker Desktop), and Bill Nalens' C++ distiller. Until recently, the Python distiller did not support robots.txt; now it does.

There is also JPluck, Sunrise, pdaConverter, pler, Bluefish, and my own Perl spider (which, by the way, adheres to the robots exclusion specification, the first and until recently, only Plucker distiller to do so), and probably other tools that we don't know about that can produce a Plucker document using the Plucker document format. At least a dozen commercial companies are using the Plucker viewer and document format now for their core product suites.

But the reason caching pages and ignoring robots.txt makes caching irrelevant, is because you are allowing your tool to fetch content it is forbidden from fetching, via robots.txt. In many cases, the excluded portions of sites are dynamic, and the Last-Modified, Etag, etc. headers will either not be present, or will force a re-fetch. Its wasteful, and makes caching top-level pages irrelevant, if you allow someone to fetch dozens, hundreds, thousands of pages that are forbidden.

But its your tool, and you're free to adhere to the standards, or violate them, as you see fit.
hacker is offline   Reply With Quote