MobileRead Forums - View Single Post

GrAfFiT · 03-03-2005, 08:18 AM

I'm trying to spider the Palm version of Slashdot with Sunrise.
The home page ( http://slashdot.org/palm ) contains the last 7 stories and a link to next page and itself.
Each index page is dynamically generated by apache's mod_rewrite (I think?) with URLs looking like : http://slashdot.org/palm/headlines_n.shtml with n begining at 2.
Each of these index pages link to stories with URLs looking like :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_n.shtml
aa is the section's ID, bb/cc/dd is a date in YY/MM/DD format (not the actual publishing time anyway), eeeeee is a timestamp in HHMMSS format. They are not very useful. The n, the page number, is usually a 1 as most stories take only one page. Otherwise each pages of a story are chained together by previous/next links.
Each of these stories also links to a comments page :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments.shtml
It is an index of the 5 best moderated comments linked as :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments_n-m.shtml
where n is the number of the comment (1 to 5) and m is the page number of the comment using the same previous/next chaining system.

I can't find a way to efficiently spider from the first page. I'd like to retrieve the n first index pages (so that I would get the 7*n first stories), the contents of the stories even if they are spanned across multiple pages and their comments in the same way.

My current setting is a link depth of 5, restricted to http://slashdot/palm/ plus http://images.slashdot.org/palm/* to get the images. This way I manage to get everything from the first page to (in most cases) the last page of every comment of every story linked from the first page.
The problem is that because of the next/previous story links on almost every pages, I still get older stories. I tried at first to block http://slashdot.org/palm/headlines_2.shtml but with no luck. The last story of the first page had a link to the first story of the second index

The URL of the first story of the second page doesn't have a regexp-detectable look, so I can't simply exclude it.

Any idea out there ? (except writing an ad hoc perl/php/python script, which it seems that I'll be forced to do..)

03-03-2005, 08:18 AM	#1
GrAfFiT Junior Member Posts: 3 Karma: 10 Join Date: Mar 2005 Device: Google Nexus One	Spidering Slashdot I'm trying to spider the Palm version of Slashdot with Sunrise. The home page ( http://slashdot.org/palm ) contains the last 7 stories and a link to next page and itself. Each index page is dynamically generated by apache's mod_rewrite (I think?) with URLs looking like : http://slashdot.org/palm/headlines_n.shtml with n begining at 2. Each of these index pages link to stories with URLs looking like : http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_n.shtml aa is the section's ID, bb/cc/dd is a date in YY/MM/DD format (not the actual publishing time anyway), eeeeee is a timestamp in HHMMSS format. They are not very useful. The n, the page number, is usually a 1 as most stories take only one page. Otherwise each pages of a story are chained together by previous/next links. Each of these stories also links to a comments page : http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments.shtml It is an index of the 5 best moderated comments linked as : http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments_n-m.shtml where n is the number of the comment (1 to 5) and m is the page number of the comment using the same previous/next chaining system. I can't find a way to efficiently spider from the first page. I'd like to retrieve the n first index pages (so that I would get the 7n first stories), the contents of the stories even if they are spanned across multiple pages and their comments in the same way. My current setting is a link depth of 5, restricted to http://slashdot/palm/ plus http://images.slashdot.org/palm/ to get the images. This way I manage to get everything from the first page to (in most cases) the last page of every comment of every story linked from the first page. The problem is that because of the next/previous story links on almost every pages, I still get older stories. I tried at first to block http://slashdot.org/palm/headlines_2.shtml but with no luck. The last story of the first page had a link to the first story of the second index The URL of the first story of the second page doesn't have a regexp-detectable look, so I can't simply exclude it. Any idea out there ? (except writing an ad hoc perl/php/python script, which it seems that I'll be forced to do..)