View Single Post
Old 06-21-2010, 12:11 PM   #2178
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
Quote:
Originally Posted by Starson17 View Post
you still haven't used append_page. Add preprocess_html the way that it's used in AG.

Help Starson ....please. Another multipage issue. I encountered another website that has multipage articles and the next page is linked via an image (button image) as follows:

Code:
<a href="/GB/1027/11928295.html">
<img src="/img/next_b.gif" border="0">
</a>
Please look at the codes below (click on the Show button) that I modified from AG to combine the pages.

Here I was trying to find the image having src='/img/next_b.gif' and then grab the href for the URL but it doesn't seem to work. What did I do wrong? Help please?

Spoiler:
Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('img',attrs={'src':'/img/next_b.gif'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'left_content'})
           #for it in texttag.findAll(style=True):
           #   del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup): 
        mtag = '<meta http-equiv="content-type" content="text/html;charset=GB2312" />\n<meta http-equiv="content-language" content="utf-8" />'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['form']
        self.append_page(soup, soup.body, 3)
        #pager = soup.find('a',attrs={'class':'ab12'})
        #if pager:
        #   pager.extract()        
        return soup

Last edited by rty; 06-21-2010 at 12:21 PM.
rty is offline