Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > Miscellaneous > Archive > Mobile Sites

Notices

 
 
Thread Tools Search this Thread
Old 03-03-2005, 07:18 AM   #1
GrAfFiT
Junior Member
GrAfFiT began at the beginning.
 
GrAfFiT's Avatar
 
Posts: 3
Karma: 10
Join Date: Mar 2005
Device: Google Nexus One
Spidering Slashdot

I'm trying to spider the Palm version of Slashdot with Sunrise.
The home page ( http://slashdot.org/palm ) contains the last 7 stories and a link to next page and itself.
Each index page is dynamically generated by apache's mod_rewrite (I think?) with URLs looking like : http://slashdot.org/palm/headlines_n.shtml with n begining at 2.
Each of these index pages link to stories with URLs looking like :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_n.shtml
aa is the section's ID, bb/cc/dd is a date in YY/MM/DD format (not the actual publishing time anyway), eeeeee is a timestamp in HHMMSS format. They are not very useful. The n, the page number, is usually a 1 as most stories take only one page. Otherwise each pages of a story are chained together by previous/next links.
Each of these stories also links to a comments page :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments.shtml
It is an index of the 5 best moderated comments linked as :
http://slashdot.org/palm/aa/bb/cc/dd/eeeeee_comments_n-m.shtml
where n is the number of the comment (1 to 5) and m is the page number of the comment using the same previous/next chaining system.

I can't find a way to efficiently spider from the first page. I'd like to retrieve the n first index pages (so that I would get the 7*n first stories), the contents of the stories even if they are spanned across multiple pages and their comments in the same way.

My current setting is a link depth of 5, restricted to http://slashdot/palm/ plus http://images.slashdot.org/palm/* to get the images. This way I manage to get everything from the first page to (in most cases) the last page of every comment of every story linked from the first page.
The problem is that because of the next/previous story links on almost every pages, I still get older stories. I tried at first to block http://slashdot.org/palm/headlines_2.shtml but with no luck. The last story of the first page had a link to the first story of the second index
The URL of the first story of the second page doesn't have a regexp-detectable look, so I can't simply exclude it.

Any idea out there ? (except writing an ad hoc perl/php/python script, which it seems that I'll be forced to do..)
GrAfFiT is offline  
Old 03-03-2005, 11:48 AM   #2
halr9000
Member
halr9000 began at the beginning.
 
Posts: 14
Karma: 12
Join Date: Jan 2005
Device: Sprint/Audiovox PPC-6700
Skip slashdot, use avantslash

It's exactly what you need.

http://www.fourteenminutes.com/code/avantslash/
halr9000 is offline  
Advert
Old 03-03-2005, 01:13 PM   #3
GrAfFiT
Junior Member
GrAfFiT began at the beginning.
 
GrAfFiT's Avatar
 
Posts: 3
Karma: 10
Join Date: Mar 2005
Device: Google Nexus One
Thanks a million ! Looks like I'm not alone whining against the Slashcode..
GrAfFiT is offline  
Old 09-21-2005, 05:26 PM   #4
37lIUx7Yx4Y
jPlucker
37lIUx7Yx4Y began at the beginning.
 
37lIUx7Yx4Y's Avatar
 
Posts: 27
Karma: 10
Join Date: Sep 2005
Device: pilot 1000 , IIIxe , tapwave zodiac , Nokia 770
Spidering Slashdot with a local starting Page

Spider Slashdot with a local starting Page
using JPluck or Sunrise


sometimes they use more pages
sometimes they use less pages
{http://slashdot.org/palm/headlines_23.shtml} ?
spider depth of 3 gets external referenced pages
(1 is local Home Page)
without pictures,
without comments,
the Plucker .PDB is ~1200 KB

slashdot.html:
__________________________
<html>
<head>
<title>slashdot News PDA</title>
</head>
<body>
<div style="text-align: center;">
<a href="http://slashdot.org/palm/headlines_1.shtml">Slashdot News 1</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_2.shtml">Slashdot News 2</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_3.shtml">Slashdot News 3</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_4.shtml">Slashdot News 4</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_5.shtml">Slashdot News 5</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_6.shtml">Slashdot News 6</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_7.shtml">Slashdot News 7</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_8.shtml">Slashdot News 8</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_9.shtml">Slashdot News 9</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_10.shtml">Slashdot News 10</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_11.shtml">Slashdot News 11</a><br>
<br>
<a href="http://slashdot.org/palm/headlines_12.shtml">Slashdot News 12</a><br>
</div>
</body>
</html>

__________________________
37lIUx7Yx4Y is offline  
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Slashdot Poll on EReaders kennyc News 8 09-16-2009 02:24 AM
Slashdot as epub bro1 ePub 1 08-22-2009 04:20 AM
Plucker: Help needed with spidering goybert Reading and Management 0 07-26-2006 05:06 AM


All times are GMT -4. The time now is 07:57 AM.


MobileRead.com is a privately owned, operated and funded community.