View Single Post
Old 03-03-2005, 07:18 AM   #1
GrAfFiT
Junior Member
GrAfFiT began at the beginning.
 
GrAfFiT's Avatar
 
Posts: 3
Karma: 10
Join Date: Mar 2005
Device: Google Nexus One
Spidering Slashdot

I'm trying to spider the Palm version of Slashdot with Sunrise.
The home page ( http://slashdot.org/palm ) contains the last 7 stories and a link to next page and itself.
Each index page is dynamically generated by apache's mod_rewrite (I think?) with URLs looking like : http://slashdot.org/palm/headlines_n.shtml with n begining at 2.
Each of these index pages link to stories with URLs looking like :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_n.shtml
aa is the section's ID, bb/cc/dd is a date in YY/MM/DD format (not the actual publishing time anyway), eeeeee is a timestamp in HHMMSS format. They are not very useful. The n, the page number, is usually a 1 as most stories take only one page. Otherwise each pages of a story are chained together by previous/next links.
Each of these stories also links to a comments page :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments.shtml
It is an index of the 5 best moderated comments linked as :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments_n-m.shtml
where n is the number of the comment (1 to 5) and m is the page number of the comment using the same previous/next chaining system.

I can't find a way to efficiently spider from the first page. I'd like to retrieve the n first index pages (so that I would get the 7*n first stories), the contents of the stories even if they are spanned across multiple pages and their comments in the same way.

My current setting is a link depth of 5, restricted to http://slashdot/palm/ plus http://images.slashdot.org/palm/* to get the images. This way I manage to get everything from the first page to (in most cases) the last page of every comment of every story linked from the first page.
The problem is that because of the next/previous story links on almost every pages, I still get older stories. I tried at first to block http://slashdot.org/palm/headlines_2.shtml but with no luck. The last story of the first page had a link to the first story of the second index
The URL of the first story of the second page doesn't have a regexp-detectable look, so I can't simply exclude it.

Any idea out there ? (except writing an ad hoc perl/php/python script, which it seems that I'll be forced to do..)
GrAfFiT is offline