I have to use skip_ad_pages() for the article
http://nol.hu.feedsportal.com/c/3324...2F/story01.htm which is an ad and the actual article follows, usually after a timeout. However, my recipe is stuck at this article
http://altnyil.nolblog.hu/archives/2...adt_honositas/ . My code is
Code:
def skip_ad_pages(self, soup):
if ('advertisement' in soup.find('title').string.lower()):
href = soup.find('a').get('href')
return self.index_to_soup(href, raw=True)
else:
return None
After this, the program stays endlessly (well, too long) in parsing it in
Code:
soup = BeautifulSoup(unic[0], markupMassage=nmassage)
in simple.py.
It is quite unfortunate that the required return value of skip_ad_pages() is the html source, however the only way to download it is using index_to_soup() which returns the soup, so the source is read, parsed, converted to string and parsed again, this time using the 'nmassage' which for this very file causes the parsing to be very slooow. So two issues:
- Can the return type of skip_ad_pages be tested and in case it is already a soup, be left alone?
- The nmassage string is probably suboptimal, can it be fixed?