View Single Post
Old 08-14-2012, 09:45 AM   #2
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by Steven630 View Post
I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now.
I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article:

1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing)

2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page)

3. There's no option for "single page".

1 and 2 makes it difficult to fetch multi-page articles using append_page.

Here's how the page buttons look like (on the first page of a four-page article):
Code:
<div id="pages" class="text-c">
<a class="a1" href="original link">previous page</a> <span>1</span>
<a href="original link + &page=2">2</a>
<a href="original link + &page=3">3</a>
<a href="original link + &page=4">4</a>
<a class="a1" href="original link + &page=2">next page</a></div>


(To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2)


Therefore it's something like:
previous page 1 2 3 4 next page

6 buttons on every page

(The article in question is : http://www.ittime.com.cn/index.php?m...tid=29&id=2352) It's in Chinese, I've translated "上一页" and "下一页" into "previous page" and "next page" in the previous codes.

Anyone can tell me how I should revise the recipe to fetch all pages?

Here is a good example i find in the recipe of AdventureGamers which can serve as reference material for you.

Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'pagination_big'})
        if pager:
           nextpage = soup.find('a', attrs={'class':'next-page'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'bodytext'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)
You could use Firebug to locate above the corresponding label in Firefox,and replace by yours.

Last edited by lrui; 08-14-2012 at 09:27 PM.
lrui is offline   Reply With Quote