View Single Post
Old 04-28-2012, 05:48 PM   #1
Junior Member
Spindoctor began at the beginning.
Posts: 6
Karma: 10
Join Date: Feb 2012
Device: PRS-T1
Talking Improving derStandard-recipe - how to get cover image?


I am currently improving the recipe for the Austrian newspaper "Der Standard" (

I already added some news feeds.
Now I try to change the standard cover image to the image stored on this page:

This should work with something like:
def get_cover_url(self):
The problem is, I cannot find a concept for the images name (which changes every day).

Today the image can be found at
but tomorrow it will be a different URL.

How can this be achieved?

Thank you in advance!

By the way, here is the improved .recipe:
#!/usr/bin/env  python
# -*- coding: utf-8 -*-

__license__   = 'GPL v3'
__copyright__ = '2009, Gerhard Aigner <gerhard.aigner at>'

''' - Austrian Newspaper '''
import re
from import BasicNewsRecipe

class DerStandardRecipe(BasicNewsRecipe):
    title = u'derStandardComplete'
    __author__ = 'Gerhard Aigner and Sujata Raman and Marcel Jira'
    description = u'Nachrichten aus Österreich'
    publisher =''
    category = 'news, politics, nachrichten, Austria'
    use_embedded_content = False
    remove_empty_feeds = True
    lang = 'de-AT'
    no_stylesheets = True
    encoding = 'utf-8'
    language = 'de'

    oldest_article = 1
    max_articles_per_feed = 100

    extra_css = '''
                h6{color:#404450; font-size:x-small;}
    feeds          = [
        (u'Newsroom', u''),
        (u'Inland', u''),
        (u'International', u''),
        (u'Wirtschaft', u''),
        (u'Web', u''),
        (u'Sport', u''),
        (u'Panorama', u''),
        (u'Etat', u''),
        (u'Kultur', u''),
        (u'Wissenschaft', u''),
        (u'Gesundheit', u''),
        (u'Bildung', u''),
        (u'Meinung', u''),
        (u'Lifestyle', u''),
        (u'Reisen', u''),
        (u'Karriere', u''),
        (u'Immobilien', u''),
        (u'dieStandard', u''),
        (u'daStandard', u'')

    keep_only_tags = [
                        dict(name='div', attrs={'class':["artikel","artikelLeft","artikelBody"]}) ,

    remove_tags = [
                    dict(name='link'), dict(name='meta'),dict(name='iframe'),dict(name='style'),
                    dict(name='form',attrs={'name':'sitesearch'}), dict(name='hr'),
                    dict(name='div', attrs={'class':["diashow"]})]
    preprocess_regexps = [
        (re.compile(r'\[[\d]*\]', re.DOTALL|re.IGNORECASE), lambda match: ''),
        (re.compile(r'bgcolor="#\w{3,6}"', re.DOTALL|re.IGNORECASE), lambda match: '')

    filter_regexps = [r'/r[1-9]*']

    def get_article_url(self, article):
        '''if the article links to a index page (ressort) or a picture gallery
           (ansichtssache), don't add it'''
        if ('ressort') > 0 or article.title.lower().count('ansichtssache') > 0 ):
            return None
        matchObj = re.compile(r'/r'+'[1-9]*',flags=0),,flags=0)

        if matchObj:
            return None


    def preprocess_html(self, soup):
        soup.html['xml:lang'] = self.lang
        soup.html['lang']     = self.lang
        mtag = '<meta http-equiv="Content-Type" content="text/html; charset=' + self.encoding + '">'

        for t in soup.findAll(['ul', 'li']):
   = 'div'
        return soup
Spindoctor is offline   Reply With Quote