![]() |
#1 |
Connoisseur
![]() Posts: 65
Karma: 10
Join Date: Dec 2010
Device: kindle voyage
|
skip_ad_pages & nmassage
I have to use skip_ad_pages() for the article http://nol.hu.feedsportal.com/c/3324...2F/story01.htm which is an ad and the actual article follows, usually after a timeout. However, my recipe is stuck at this article http://altnyil.nolblog.hu/archives/2...adt_honositas/ . My code is
Code:
def skip_ad_pages(self, soup): if ('advertisement' in soup.find('title').string.lower()): href = soup.find('a').get('href') return self.index_to_soup(href, raw=True) else: return None Code:
soup = BeautifulSoup(unic[0], markupMassage=nmassage) It is quite unfortunate that the required return value of skip_ad_pages() is the html source, however the only way to download it is using index_to_soup() which returns the soup, so the source is read, parsed, converted to string and parsed again, this time using the 'nmassage' which for this very file causes the parsing to be very slooow. So two issues:
|
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,334
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There's no reason to use index_to_soup, that's only for convenience. Use
html = self.browser.open(url).read().decode('utf-8', 'ignore') |
![]() |
![]() |
Advert | |
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Ended B&N Nook 3g & wifi & 3 cases $140 | lisreed | Flea Market | 0 | 07-29-2010 04:00 AM |
Reference Tebb & Vollum: Premature Burial & How it may be Prevented, v1, 21 Oct 2007. | Patricia | Kindle Books | 0 | 10-21-2007 06:58 PM |
Reference Tebb & Vollum: Premature Burial & How it may be Prevented, v1, 21 Oct 2007. | Patricia | BBeB/LRF Books | 0 | 10-21-2007 06:53 PM |
(Problem Resolved) AARGGGGHHHHH!!! Sony #&%@&^#(*%!*#& CS | DrMoze | Sony Reader | 11 | 09-06-2007 11:17 AM |