MobileRead Forums - View Single Post

ignatz · 03-12-2004, 04:52 PM

Okay, now I've got a good one. The core of this scoop came from the sitescooper mailing list and was written by Kennis Koldewyn. I've just expanded and tweaked it a bit. The basic idea is great. You have an html file on your desktop that contains links to all the text only menus at the NYT. This local html file is your URL. The site file is 3 levels deep, so you get your local file as the top level, then the link to headlines, and finally the stories. In preliminary testing it has performed admirably.

However, there are a few outstanding issues. First, I recommend that you severely limit the categories from which you download. There are a lot of stories available and your converted file can easily get big in a hurry. The raw html file here has every option commented out except for National and International headlines. But I have included every category that you see on this page. What you must do is delete the open and close comment markers on the sections that you want. (Open comment is "".) I've been using only 10 sections and I can quickly go up to 900KB unconverted. (iSilo then shrinks this back down to around 300KB.) If your raw converted filesize is above 500KB, sitescooper will stop scooping. You have to add a parameter into your scooping command to redefine the limit. For example, if your command is:

perl sitescooper.pl -site NYTimes.site -misilox

and sitescooper is reporting that it's running over the limit, you can add a parameter like the following:

perl sitescooper.pl -site NYTimes.site -limit 1000 -misilox

This will up the limit to 1000KB. If it's still not enough go back and change it again.

Also, some of the categories keep stories that are way out of date. If the stories are more than 10 days old, the URL that this site file uses gets redirected (because of the way that NYT archives their old content) and you lose the printer-friendly page. So if that page is split over two pages, you won't get the second page. I have tried a few tricks, such as setting the "StoryFollowLinks" parameter in the site file to 1, but hasn't worked. I'm also looking at possible ways to filter out the older URLs and just not scoop them at all, but that involves some perl date manipulation, and I haven't got that knack yet.

Also, sometimes I've seen story pages left blank on one run that work fine on the next run. This may be some sort of network issue or something. But if it doesn't work the first time through, try running it again and see if it picks up what it missed the first time around.

Regardless, in my testing it has worked fabulously. There's no cookies issue. The printer friendly pages make for nice reading. If you've been waiting for a non-Avantgo NYTimes, here's your chance. If this works for you, please let me know! If you encounter any weird behavior, please let me know. I haven't checked even a 1/4 of the possible pages, so anything could happen. The movies section had slightly different formatting than the other pages and required a little tweaking. Some other section might also.

To summarize, download the new_york_times.html file below (actually it shows up as new_york_time.txt, because html extension is not allowed - once you download it, change the extension back to html). Download the NYTimes.site file. Put them in your sitescooper folder. You will have to edit the URL portion of the site file to reflect exactly where the new_york_times.html file is. Then create a batch to run this one exclusively, like in the examples at the top of the page, or add the NYTimes.site file to your sites directory and let it run when the rest of your sites run.

Sitescooper is more complicated than the other guys, but well worth the effort. Any questions or comments? Let me hear it...