Quote:
Originally Posted by Steven630
I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now.
I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article:
1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing)
2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page)
3. There's no option for "single page".
1 and 2 makes it difficult to fetch multi-page articles using append_page.
Here's how the page buttons look like (on the first page of a four-page article):
Code:
<div id="pages" class="text-c">
<a class="a1" href="original link">previous page</a> <span>1</span>
<a href="original link + &page=2">2</a>
<a href="original link + &page=3">3</a>
<a href="original link + &page=4">4</a>
<a class="a1" href="original link + &page=2">next page</a></div>
(To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2)
Therefore it's something like:
previous page 1 2 3 4 next page
6 buttons on every page
(The article in question is : http://www.ittime.com.cn/index.php?m...tid=29&id=2352) It's in Chinese, I've translated "上一页" and "下一页" into "previous page" and "next page" in the previous codes.
Anyone can tell me how I should revise the recipe to fetch all pages?
|
Here is a good example i find in the recipe of AdventureGamers which can serve as reference material for you.
Code:
def append_page(self, soup, appendtag, position):
pager = soup.find('div', attrs={'class':'pagination_big'})
if pager:
nextpage = soup.find('a', attrs={'class':'next-page'})
if nextpage:
nexturl = nextpage['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'class':'bodytext'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
pager.extract()
appendtag.insert(position,texttag)
You could use Firebug to locate above the corresponding label in Firefox,and replace by yours.