MobileRead Forums - View Single Post

hacker · 03-14-2005, 10:29 PM

Quote:

Originally Posted by Laurens

Last time I checked, Plucker Desktop came configured with global exclusion filters for well-known ad URLs (windcaster and such). Did no-one complain about their lost ad revenue?

No, because when those were created, most of those advertisers were forcing garbage ads, spyware, popups and other trash on the users. Better off without them, in most cases. Its a fine line, to be sure.

Quote:

Plucker and iSilo also ignore robots.txt, don't they? Now why is this a problem all of a sudden? And how does ignoring robots.txt make caching irrelevant?

I think you mean Plucker's python distiller, not Plucker itself.

Plucker is a viewer, primarily, which supports a document format that can be produced by many tools. The two most-popular document creators for Plucker are currently the Python Distiller (used in Plucker Desktop), and Bill Nalens' C++ distiller. Until recently, the Python distiller did not support robots.txt; now it does.

There is also JPluck, Sunrise, pdaConverter, pler, Bluefish, and my own Perl spider (which, by the way, adheres to the robots exclusion specification, the first and until recently, only Plucker distiller to do so), and probably other tools that we don't know about that can produce a Plucker document using the Plucker document format. At least a dozen commercial companies are using the Plucker viewer and document format now for their core product suites.

But the reason caching pages and ignoring robots.txt makes caching irrelevant, is because you are allowing your tool to fetch content it is forbidden from fetching, via robots.txt. In many cases, the excluded portions of sites are dynamic, and the Last-Modified, Etag, etc. headers will either not be present, or will force a re-fetch. Its wasteful, and makes caching top-level pages irrelevant, if you allow someone to fetch dozens, hundreds, thousands of pages that are forbidden.

But its your tool, and you're free to adhere to the standards, or violate them, as you see fit.