03-03-2005, 07:18 AM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2005
Device: Google Nexus One
|
Spidering Slashdot
I'm trying to spider the Palm version of Slashdot with Sunrise.
The home page ( http://slashdot.org/palm ) contains the last 7 stories and a link to next page and itself. Each index page is dynamically generated by apache's mod_rewrite (I think?) with URLs looking like : http://slashdot.org/palm/headlines_n.shtml with n begining at 2. Each of these index pages link to stories with URLs looking like : http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_n.shtml aa is the section's ID, bb/cc/dd is a date in YY/MM/DD format (not the actual publishing time anyway), eeeeee is a timestamp in HHMMSS format. They are not very useful. The n, the page number, is usually a 1 as most stories take only one page. Otherwise each pages of a story are chained together by previous/next links. Each of these stories also links to a comments page : http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments.shtml It is an index of the 5 best moderated comments linked as : http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments_n-m.shtml where n is the number of the comment (1 to 5) and m is the page number of the comment using the same previous/next chaining system. I can't find a way to efficiently spider from the first page. I'd like to retrieve the n first index pages (so that I would get the 7*n first stories), the contents of the stories even if they are spanned across multiple pages and their comments in the same way. My current setting is a link depth of 5, restricted to http://slashdot/palm/ plus http://images.slashdot.org/palm/* to get the images. This way I manage to get everything from the first page to (in most cases) the last page of every comment of every story linked from the first page. The problem is that because of the next/previous story links on almost every pages, I still get older stories. I tried at first to block http://slashdot.org/palm/headlines_2.shtml but with no luck. The last story of the first page had a link to the first story of the second index The URL of the first story of the second page doesn't have a regexp-detectable look, so I can't simply exclude it. Any idea out there ? (except writing an ad hoc perl/php/python script, which it seems that I'll be forced to do..) |
03-03-2005, 11:48 AM | #2 |
Member
Posts: 14
Karma: 12
Join Date: Jan 2005
Device: Sprint/Audiovox PPC-6700
|
Skip slashdot, use avantslash
|
Advert | |
|
03-03-2005, 01:13 PM | #3 |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2005
Device: Google Nexus One
|
Thanks a million ! Looks like I'm not alone whining against the Slashcode..
|
09-21-2005, 05:26 PM | #4 |
jPlucker
Posts: 27
Karma: 10
Join Date: Sep 2005
Device: pilot 1000 , IIIxe , tapwave zodiac , Nokia 770
|
Spidering Slashdot with a local starting Page
Spider Slashdot with a local starting Page
using JPluck or Sunrise sometimes they use more pages sometimes they use less pages {http://slashdot.org/palm/headlines_23.shtml} ? spider depth of 3 gets external referenced pages (1 is local Home Page) without pictures, without comments, the Plucker .PDB is ~1200 KB slashdot.html: __________________________ <html> <head> <title>slashdot News PDA</title> </head> <body> <div style="text-align: center;"> <a href="http://slashdot.org/palm/headlines_1.shtml">Slashdot News 1</a><br> <br> <a href="http://slashdot.org/palm/headlines_2.shtml">Slashdot News 2</a><br> <br> <a href="http://slashdot.org/palm/headlines_3.shtml">Slashdot News 3</a><br> <br> <a href="http://slashdot.org/palm/headlines_4.shtml">Slashdot News 4</a><br> <br> <a href="http://slashdot.org/palm/headlines_5.shtml">Slashdot News 5</a><br> <br> <a href="http://slashdot.org/palm/headlines_6.shtml">Slashdot News 6</a><br> <br> <a href="http://slashdot.org/palm/headlines_7.shtml">Slashdot News 7</a><br> <br> <a href="http://slashdot.org/palm/headlines_8.shtml">Slashdot News 8</a><br> <br> <a href="http://slashdot.org/palm/headlines_9.shtml">Slashdot News 9</a><br> <br> <a href="http://slashdot.org/palm/headlines_10.shtml">Slashdot News 10</a><br> <br> <a href="http://slashdot.org/palm/headlines_11.shtml">Slashdot News 11</a><br> <br> <a href="http://slashdot.org/palm/headlines_12.shtml">Slashdot News 12</a><br> </div> </body> </html> __________________________ |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Slashdot Poll on EReaders | kennyc | News | 8 | 09-16-2009 02:24 AM |
Slashdot as epub | bro1 | ePub | 1 | 08-22-2009 04:20 AM |
Plucker: Help needed with spidering | goybert | Reading and Management | 0 | 07-26-2006 05:06 AM |