Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > Miscellaneous > Archive > Sitescooper

Notices

 
 
Thread Tools Search this Thread
Old 03-12-2004, 12:03 AM   #1
ignatz
mechanoholic
ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.
 
ignatz's Avatar
 
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
Exclamation New York Times Scoop!

Okay, I have worked out a rudimentary site file that reads the New York Times front page from the RadioUserland rss feed. It works well, but the format of the output contents page is really ugly and needs some major help. (On the other hand, the story pages look great.) I'm struggling to figure out how to make this change, but it the meantime, feel free to give it a whirl. Any comments are welcome. Any guidance on perl would be great.

You can either copy the following text and save it in a file with the .site extension (eg. NYT_Front.site) or just download the attachment.
#NYTimes Front Page
#sitescooper .site file by Ignatz Sol
URL: http://partners.userland.com/nytrss/nytHomepage.xml
Name: New York Times: Front Page
Description: The latest New York Times front page headlines.
ContentsFormat: rss

Levels: 2
StoryURL: http://www.nytimes.com/.+USERLAND.*
StoryToPrintableSub: s/USERLAND/USERLAND&pagewanted=print&position=/

StoryStart: </head>
StoryEnd: /NYT_TEXT
Attached Files
File Type: site NYTimes_Frontpage.site (413 Bytes, 1061 views)
ignatz is offline  
Old 03-12-2004, 11:00 AM   #2
ignatz
mechanoholic
ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.
 
ignatz's Avatar
 
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
Never mind this one: I've got much better coming. Admittedly, it's not mine. I've found a great scoop by Kennis Koldewyn on the sitescooper mailing list and I'm modifying it to improve it and make it easier for everyone to get what they want from it. Stay tuned, the New York Times is almost within reach.
ignatz is offline  
Old 03-12-2004, 04:52 PM   #3
ignatz
mechanoholic
ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.
 
ignatz's Avatar
 
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
Okay, now I've got a good one. The core of this scoop came from the sitescooper mailing list and was written by Kennis Koldewyn. I've just expanded and tweaked it a bit. The basic idea is great. You have an html file on your desktop that contains links to all the text only menus at the NYT. This local html file is your URL. The site file is 3 levels deep, so you get your local file as the top level, then the link to headlines, and finally the stories. In preliminary testing it has performed admirably.

However, there are a few outstanding issues. First, I recommend that you severely limit the categories from which you download. There are a lot of stories available and your converted file can easily get big in a hurry. The raw html file here has every option commented out except for National and International headlines. But I have included every category that you see on this page. What you must do is delete the open and close comment markers on the sections that you want. (Open comment is "<!--" and close comment is "-->".) I've been using only 10 sections and I can quickly go up to 900KB unconverted. (iSilo then shrinks this back down to around 300KB.) If your raw converted filesize is above 500KB, sitescooper will stop scooping. You have to add a parameter into your scooping command to redefine the limit. For example, if your command is:
perl sitescooper.pl -site NYTimes.site -misilox
and sitescooper is reporting that it's running over the limit, you can add a parameter like the following:
perl sitescooper.pl -site NYTimes.site -limit 1000 -misilox
This will up the limit to 1000KB. If it's still not enough go back and change it again.

Also, some of the categories keep stories that are way out of date. If the stories are more than 10 days old, the URL that this site file uses gets redirected (because of the way that NYT archives their old content) and you lose the printer-friendly page. So if that page is split over two pages, you won't get the second page. I have tried a few tricks, such as setting the "StoryFollowLinks" parameter in the site file to 1, but hasn't worked. I'm also looking at possible ways to filter out the older URLs and just not scoop them at all, but that involves some perl date manipulation, and I haven't got that knack yet.

Also, sometimes I've seen story pages left blank on one run that work fine on the next run. This may be some sort of network issue or something. But if it doesn't work the first time through, try running it again and see if it picks up what it missed the first time around.

Regardless, in my testing it has worked fabulously. There's no cookies issue. The printer friendly pages make for nice reading. If you've been waiting for a non-Avantgo NYTimes, here's your chance. If this works for you, please let me know! If you encounter any weird behavior, please let me know. I haven't checked even a 1/4 of the possible pages, so anything could happen. The movies section had slightly different formatting than the other pages and required a little tweaking. Some other section might also.

To summarize, download the new_york_times.html file below (actually it shows up as new_york_time.txt, because html extension is not allowed - once you download it, change the extension back to html). Download the NYTimes.site file. Put them in your sitescooper folder. You will have to edit the URL portion of the site file to reflect exactly where the new_york_times.html file is. Then create a batch to run this one exclusively, like in the examples at the top of the page, or add the NYTimes.site file to your sites directory and let it run when the rest of your sites run.

Sitescooper is more complicated than the other guys, but well worth the effort. Any questions or comments? Let me hear it...
Attached Files
File Type: txt new_york_times.txt (10.9 KB, 1122 views)
File Type: site NYTimes.site (1.7 KB, 988 views)

Last edited by ignatz; 03-12-2004 at 05:25 PM.
ignatz is offline  
Old 03-15-2004, 10:47 PM   #4
ignatz
mechanoholic
ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.
 
ignatz's Avatar
 
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
I'm now using this scoop daily and it works great. Another advantage of grabbing this much info with Sitescooper is that it can compare with it's cache and not spend time reconverting data that it has already done. With several days to weeks of stories available, this could help speed conversion.

My iSilo converts at 5:30 am, Sitescooper at 5:45, and then it all syncs at 6. When I walk in and grab my Palm I'm ready to go...

If anyone has concerns about configuring these files, I'd be happy to help, and will customize them for you. Let me know.
ignatz is offline  
Old 03-22-2004, 09:34 AM   #5
melvynadam
Connoisseur
melvynadam began at the beginning.
 
melvynadam's Avatar
 
Posts: 69
Karma: 10
Join Date: Jan 2004
Location: Israel
Device: Kindle Paperwhite, iPhone, iPad
Okay, I've never used SiteScopper but would love to get the NY Times into my iSilo every day. Is there a way of doing this without Sitescooper?
melvynadam is offline  
Old 04-08-2004, 10:16 AM   #6
ignatz
mechanoholic
ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.ignatz ought to be getting tired of karma fortunes by now.
 
ignatz's Avatar
 
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
Here's a few "premade" individual section scoops for International, National, and Technology. More are available upon request. Alex, you can add these to the "scoops" section here.

They do require one edit. You must open the .site file and change the path to the appropriate html file (included here).
Attached Files
File Type: zip NYTimes_scoops.zip (6.4 KB, 979 views)

Last edited by ignatz; 04-08-2004 at 01:33 PM.
ignatz is offline  
Old 04-08-2004, 11:37 AM   #7
Alexander Turcic
Fully Converged
Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.
 
Alexander Turcic's Avatar
 
Posts: 18,163
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
Thanks Ignatz. I will do this as soon as I am back at work (next week, after Eastern). My machine at work does all the scooping work :P
Alexander Turcic is offline  
Old 04-08-2004, 01:45 PM   #8
Zire
Fanatic
Zire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshesZire can read faster than his screen refreshes
 
Zire's Avatar
 
Posts: 522
Karma: 14050
Join Date: May 2003
Location: Astoria, NY
Device: Zire 71
Really would like the NYTimes thing put to rest. Thanks for the update.
Zire is offline  
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New York Times review: K2 akira28 Amazon Kindle 32 02-28-2009 02:23 PM
New York times about Kindle 2 Kris777 News 12 02-18-2009 08:51 AM
New York Times on 505 Hamza Sony Reader 21 03-03-2008 12:55 PM
iLiad New York Times King Mook Mook iRex 0 12-30-2007 03:22 PM
New Reader Ad in New York Times TadW Sony Reader 7 07-28-2007 01:11 PM


All times are GMT -4. The time now is 02:46 PM.


MobileRead.com is a privately owned, operated and funded community.