Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-13-2011, 04:55 PM   #1
bubak
Connoisseur
bubak began at the beginning.
 
Posts: 65
Karma: 10
Join Date: Dec 2010
Device: kindle voyage
skip_ad_pages & nmassage

I have to use skip_ad_pages() for the article http://nol.hu.feedsportal.com/c/3324...2F/story01.htm which is an ad and the actual article follows, usually after a timeout. However, my recipe is stuck at this article http://altnyil.nolblog.hu/archives/2...adt_honositas/ . My code is

Code:
    
    def skip_ad_pages(self, soup):
        if ('advertisement' in soup.find('title').string.lower()):
            href = soup.find('a').get('href')
            return self.index_to_soup(href, raw=True)
        else:
            return None
After this, the program stays endlessly (well, too long) in parsing it in
Code:
soup = BeautifulSoup(unic[0], markupMassage=nmassage)
in simple.py.
It is quite unfortunate that the required return value of skip_ad_pages() is the html source, however the only way to download it is using index_to_soup() which returns the soup, so the source is read, parsed, converted to string and parsed again, this time using the 'nmassage' which for this very file causes the parsing to be very slooow. So two issues:
  • Can the return type of skip_ad_pages be tested and in case it is already a soup, be left alone?
  • The nmassage string is probably suboptimal, can it be fixed?
bubak is offline   Reply With Quote
Old 04-13-2011, 05:00 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,334
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There's no reason to use index_to_soup, that's only for convenience. Use

html = self.browser.open(url).read().decode('utf-8', 'ignore')
kovidgoyal is online now   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Ended B&N Nook 3g & wifi & 3 cases $140 lisreed Flea Market 0 07-29-2010 04:00 AM
Reference Tebb & Vollum: Premature Burial & How it may be Prevented, v1, 21 Oct 2007. Patricia Kindle Books 0 10-21-2007 06:58 PM
Reference Tebb & Vollum: Premature Burial & How it may be Prevented, v1, 21 Oct 2007. Patricia BBeB/LRF Books 0 10-21-2007 06:53 PM
(Problem Resolved) AARGGGGHHHHH!!! Sony #&%@&^#(*%!*#& CS DrMoze Sony Reader 11 09-06-2007 11:17 AM


All times are GMT -4. The time now is 01:26 AM.


MobileRead.com is a privately owned, operated and funded community.