View Single Post
Old 08-12-2005, 11:22 AM   #3
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
Sitescooper has been doing this for at least the last 5 years, probably longer, and it supports quite a few output formats (including my favorite).

This isn't revolutionary at all, and it seems everyone is jumping onto the "per-site template" bandwagon: Sunrise, Mobipocket and now TomeRaider. Its a dead-end approach, since the "templates" are very fragile. Add an inner table on the site and your whole template breaks and has to be debugged and written over. Rename a page resource on the server-side, change a query string and all of it topples down in to a pile of goo.

Its a dead-end direction for "metapublishing" from the client perspective. Now, if content providers offered their website for download as a neatly packaged ebook-type-of-file, that would be a different story entirely, but hardly any do.

The fear (speaking as a content provider) is that we're getting pounded by thousands of users who are all running these tools, crawlers, spiders and other things against our sites (and our client's sites) without considering the implications of a script that requests 1,600 pages in 5 minutes (as someone did yesterday on one of our servers, trying to get the entire history of the jpilot mailing list with some Java tool).

I'm blocking dozens per-day, and I'll continue to block them until they begin to adhere to the robots.txt specification and learn to respect Crawl-delay and If-Modified-Since header for feeds and other content.

Most of the RSS readers out there are another perfect example. The whole point of RSS is to syndicate the news, but instead we get 5,000 people fetching the same feed every hour, even though it specifically says not to fetch it more than once a week. Even then, in some cases there are no new items in a couple of weeks, but they still fetch it every hour anyway (ignoring If-Modified-Since).

Sorry, now you're blocked.

I wish people who wrote these "tools" would consider the implications of what they're doing. Most do not.
hacker is offline   Reply With Quote