skip_ad_pages & nmassage

bubak · 04-13-2011, 05:55 PM

I have to use skip_ad_pages() for the article http://nol.hu.feedsportal.com/c/3324...2F/story01.htm which is an ad and the actual article follows, usually after a timeout. However, my recipe is stuck at this article http://altnyil.nolblog.hu/archives/2...adt_honositas/ . My code is

Code:

    
    def skip_ad_pages(self, soup):
        if ('advertisement' in soup.find('title').string.lower()):
            href = soup.find('a').get('href')
            return self.index_to_soup(href, raw=True)
        else:
            return None

After this, the program stays endlessly (well, too long) in parsing it in

Code:

soup = BeautifulSoup(unic[0], markupMassage=nmassage)

in simple.py.
It is quite unfortunate that the required return value of skip_ad_pages() is the html source, however the only way to download it is using index_to_soup() which returns the soup, so the source is read, parsed, converted to string and parsed again, this time using the 'nmassage' which for this very file causes the parsing to be very slooow. So two issues:

Can the return type of skip_ad_pages be tested and in case it is already a soup, be left alone?
The nmassage string is probably suboptimal, can it be fixed?

kovidgoyal · 04-13-2011, 06:00 PM

There's no reason to use index_to_soup, that's only for convenience. Use

html = self.browser.open(url).read().decode('utf-8', 'ignore')

04-13-2011, 05:55 PM	#1
bubak Connoisseur Posts: 65 Karma: 10 Join Date: Dec 2010 Device: kindle voyage	skip_ad_pages & nmassage I have to use skip_ad_pages() for the article http://nol.hu.feedsportal.com/c/3324...2F/story01.htm which is an ad and the actual article follows, usually after a timeout. However, my recipe is stuck at this article http://altnyil.nolblog.hu/archives/2...adt_honositas/ . My code is Code: def skip_ad_pages(self, soup): if ('advertisement' in soup.find('title').string.lower()): href = soup.find('a').get('href') return self.index_to_soup(href, raw=True) else: return None After this, the program stays endlessly (well, too long) in parsing it in Code: soup = BeautifulSoup(unic[0], markupMassage=nmassage) in simple.py. It is quite unfortunate that the required return value of skip_ad_pages() is the html source, however the only way to download it is using index_to_soup() which returns the soup, so the source is read, parsed, converted to string and parsed again, this time using the 'nmassage' which for this very file causes the parsing to be very slooow. So two issues: Can the return type of skip_ad_pages be tested and in case it is already a soup, be left alone? The nmassage string is probably suboptimal, can it be fixed?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Ended B&N Nook 3g & wifi & 3 cases $140	lisreed	Flea Market	0	07-29-2010 05:00 AM
Reference Tebb & Vollum: Premature Burial & How it may be Prevented, v1, 21 Oct 2007.	Patricia	Kindle Books	0	10-21-2007 07:58 PM
Reference Tebb & Vollum: Premature Burial & How it may be Prevented, v1, 21 Oct 2007.	Patricia	BBeB/LRF Books	0	10-21-2007 07:53 PM
(Problem Resolved) AARGGGGHHHHH!!! Sony #&%@&^#(%!#& CS	DrMoze	Sony Reader	11	09-06-2007 12:17 PM

04-13-2011, 06:00 PM	#2
kovidgoyal creator of calibre Posts: 45,656 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There's no reason to use index_to_soup, that's only for convenience. Use html = self.browser.open(url).read().decode('utf-8', 'ignore')

Advert