View Single Post
Old 11-09-2011, 10:01 AM   #17
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Multiple Page Sites

This is not my code, but there have been many requests for code to handle sites where each article is split into multiple pages. At the bottom of each page will be a button to go to the next page. Here is typical code from Darko Miletic's builtin recipe for Adventure Gamers that is used in this situation:

You may want to look at the source for an article at Adventure Gamers with FireBug or equivalent. The append_page code identifies each "next page" button, follows the link it points to ("nexturl"), finds the article text on that next page, inserts that text into the first page beneath the article text found on the first page, and recursively reiterates that process until the last page (identified by not having the "next page" button) is found.

The append_page code is then used in preprocess_html.
Spoiler:
Code:
    INDEX                 = u'http://www.adventuregamers.com'
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'bodytext'})
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'toolbar_fat})
        if pager:
           pager.extract()
        return self.adeify_images(soup)
Starson17 is offline   Reply With Quote