Quote:
Originally Posted by Starson17
you still haven't used append_page. Add preprocess_html the way that it's used in AG.
|
Help Starson ....please. Another multipage issue. I encountered another website that has multipage articles and the next page is linked via an image (button image) as follows:
Code:
<a href="/GB/1027/11928295.html">
<img src="/img/next_b.gif" border="0">
</a>
Please look at the codes below (click on the Show button) that I modified from AG to combine the pages.
Here I was trying to find the image having src='/img/next_b.gif' and then grab the href for the URL but it doesn't seem to work. What did I do wrong? Help please?
Spoiler:
Code:
def append_page(self, soup, appendtag, position):
pager = soup.find('img',attrs={'src':'/img/next_b.gif'})
if pager:
nexturl = self.INDEX + pager.a['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'class':'left_content'})
#for it in texttag.findAll(style=True):
# del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
appendtag.insert(position,texttag)
def preprocess_html(self, soup):
mtag = '<meta http-equiv="content-type" content="text/html;charset=GB2312" />\n<meta http-equiv="content-language" content="utf-8" />'
soup.head.insert(0,mtag)
for item in soup.findAll(style=True):
del item['form']
self.append_page(soup, soup.body, 3)
#pager = soup.find('a',attrs={'class':'ab12'})
#if pager:
# pager.extract()
return soup