View Single Post
Old 11-07-2010, 04:49 PM   #1
malfi
Member
malfi began at the beginning.
 
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
Thumbs down European Voice (by The Economist) - Problem with skipping

Hi, I fiddled around with python today and try to understand the mechanisms of the calibre framework.

The problem with European Voice is, that several linked articles from their feed can only be read as a subscriber. So I want to skip these articles as it becomes clear, that they don't contain any content. I thought the preprocess_html method could be used. If the word subscriber is in the soup, I just don't return the soup object but none.

Code:
class EuropeanVoice(BasicNewsRecipe):
    title          = u'European Voice'
    oldest_article = 14
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.europeanvoice.com/Css/images/logo.gif'
    language = 'en'
    keep_only_tags    = [dict(name='div', attrs={'id':'articleLeftColumn'})]
    remove_tags    = [dict(name='div', attrs={'id':'BreadCrump'})]
    feeds          = [
                        (u'Whole site ',u'http://www.europeanvoice.com/Rss/2.xml'),
                          (u'News and analysis',u'http://www.europeanvoice.com/Rss/6.xml'),
                          (u'Comment',u'http://www.europeanvoice.com/Rss/7.xml'),
                          (u'Special reports',u'http://www.europeanvoice.com/Rss/5.xml'),
                          (u'People',u'http://www.europeanvoice.com/Rss/8.xml'),
                          (u'Career',u'http://www.europeanvoice.com/Rss/11.xml'),
                          (u'Policies',u'http://www.europeanvoice.com/Rss/4.xml'),
                          (u'EVents',u'http://www.europeanvoice.com/Rss/10.xml'),
                          (u'Policies - Economics',u'http://www.europeanvoice.com/Rss/31.xml'),
                          (u'Policies - Business',u'http://www.europeanvoice.com/Rss/19.xml'),
                          (u'Policies - Trade',u'http://www.europeanvoice.com/Rss/25.xml'),
                          (u'Policies - Information society',u'http://www.europeanvoice.com/Rss/20.xml'),
                          (u'Policies - Energy',u'http://www.europeanvoice.com/Rss/15.xml'),
                          (u'Policies - Transport',u'http://www.europeanvoice.com/Rss/18.xml'),
                          (u'Policies - Climate change',u'http://www.europeanvoice.com/Rss/16.xml'),
                          (u'Policies - Environment',u'http://www.europeanvoice.com/Rss/17.xml'),
                          (u'Policies - Farming & food',u'http://www.europeanvoice.com/Rss/23.xml'),
                          (u'Policies - Health & society',u'http://www.europeanvoice.com/Rss/24.xml'),
                          (u'Policies - Justice',u'http://www.europeanvoice.com/Rss/29.xml'),
                          (u'Policies - Foreign affairs',u'http://www.europeanvoice.com/Rss/27.xml')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''
    def print_version(self, url):
          return url + '?bPrint=1'
    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           if denied:
                self.log('\t skipped, because content can only be seen with subscription')
                return None
           return soup
That doesn't really work, because these articles are merely not downloaded, but still in the index. How am I supposed to be able to skip articles comletely?
malfi is offline   Reply With Quote