Improving derStandard-recipe - how to get cover image?

Spindoctor · 04-28-2012, 05:48 PM

Hi!

I am currently improving the recipe for the Austrian newspaper "Der Standard" (http://derstandard.at).

I already added some news feeds.
Now I try to change the standard cover image to the image stored on this page:

http://epaper.derstandarddigital.at/

This should work with something like:

Code:

def get_cover_url(self):
...

The problem is, I cannot find a concept for the images name (which changes every day).

Today the image can be found at
http://epaper.derstandarddigital.at/...5362970255.png
but tomorrow it will be a different URL.

How can this be achieved?

Thank you in advance!

By the way, here is the improved .recipe:

Code:

#!/usr/bin/env  python
# -*- coding: utf-8 -*-

__license__   = 'GPL v3'
__copyright__ = '2009, Gerhard Aigner <gerhard.aigner at gmail.com>'

''' http://www.derstandard.at - Austrian Newspaper '''
import re
from calibre.web.feeds.news import BasicNewsRecipe

class DerStandardRecipe(BasicNewsRecipe):
    title = u'derStandardComplete'
    __author__ = 'Gerhard Aigner and Sujata Raman and Marcel Jira'
    description = u'Nachrichten aus Österreich'
    publisher ='derStandard.at'
    category = 'news, politics, nachrichten, Austria'
    use_embedded_content = False
    remove_empty_feeds = True
    lang = 'de-AT'
    no_stylesheets = True
    encoding = 'utf-8'
    language = 'de'

    oldest_article = 1
    max_articles_per_feed = 100

    extra_css = '''
                .artikelBody{font-family:Arial,Helvetica,sans-serif;}
                .artikelLeft{font-family:Arial,Helvetica,sans-serif;font-size:x-small;}
                h4{color:#404450;font-size:x-small;}
                h6{color:#404450; font-size:x-small;}
                '''
    feeds          = [
        (u'Newsroom', u'http://derStandard.at/?page=rss&ressort=Seite1'),
        (u'Inland', u'http://derstandard.at/?page=rss&ressort=InnenPolitik'),
        (u'International', u'http://derstandard.at/?page=rss&ressort=InternationalPolitik'),
        (u'Wirtschaft', u'http://derStandard.at/?page=rss&ressort=Wirtschaft'),
        (u'Web', u'http://derStandard.at/?page=rss&ressort=Web'),
        (u'Sport', u'http://derStandard.at/?page=rss&ressort=Sport'),
        (u'Panorama', u'http://derStandard.at/?page=rss&ressort=Panorama'),
        (u'Etat', u'http://derStandard.at/?page=rss&ressort=Etat'),
        (u'Kultur', u'http://derStandard.at/?page=rss&ressort=Kultur'),
        (u'Wissenschaft', u'http://derStandard.at/?page=rss&ressort=Wissenschaft'),
        (u'Gesundheit', u'http://derStandard.at/?page=rss&ressort=Gesundheit'),
        (u'Bildung', u'http://derStandard.at/?page=rss&ressort=Bildung'),
        (u'Meinung', u'http://derStandard.at/?page=rss&ressort=Meinung'),
        (u'Lifestyle', u'http://derStandard.at/?page=rss&ressort=Lifestyle'),
        (u'Reisen', u'http://derStandard.at/?page=rss&ressort=Reisen'),
        (u'Karriere', u'http://derStandard.at/?page=rss&ressort=Karriere'),
        (u'Immobilien', u'http://derstandard.at/?page=rss&ressort=Immobilien'),
        (u'dieStandard', u'http://dieStandard.at/?page=rss&ressort=diestandard'),
        (u'daStandard', u'http://daStandard.at/?page=rss&ressort=dastandard')
                      ]

    keep_only_tags = [
                        dict(name='div', attrs={'class':["artikel","artikelLeft","artikelBody"]}) ,
                         ]

    remove_tags = [
                    dict(name='link'), dict(name='meta'),dict(name='iframe'),dict(name='style'),
                    dict(name='form',attrs={'name':'sitesearch'}), dict(name='hr'),
                    dict(name='div', attrs={'class':["diashow"]})]
    preprocess_regexps = [
        (re.compile(r'\[[\d]*\]', re.DOTALL|re.IGNORECASE), lambda match: ''),
        (re.compile(r'bgcolor="#\w{3,6}"', re.DOTALL|re.IGNORECASE), lambda match: '')
    ]

    filter_regexps = [r'/r[1-9]*']

    def get_article_url(self, article):
        '''if the article links to a index page (ressort) or a picture gallery
           (ansichtssache), don't add it'''
        if ( article.link.count('ressort') > 0 or article.title.lower().count('ansichtssache') > 0 ):
            return None
        matchObj = re.search( re.compile(r'/r'+'[1-9]*',flags=0), article.link,flags=0)

        if matchObj:
            return None

        return article.link

    def preprocess_html(self, soup):
        soup.html['xml:lang'] = self.lang
        soup.html['lang']     = self.lang
        mtag = '<meta http-equiv="Content-Type" content="text/html; charset=' + self.encoding + '">'
        soup.head.insert(0,mtag)

        for t in soup.findAll(['ul', 'li']):
            t.name = 'div'
        return soup

kovidgoyal · 04-29-2012, 12:01 AM

You need to find a llink to the cover inthe website and extract that.

Spindoctor · 04-29-2012, 06:28 AM

@kovidgoyal:
Thank you for your answer. I did search for a link a bit, but didn't find one.

I hoped there is a way like this (in pseudo-code):

Code:

OPEN URL "http://epaper.derstandarddigital.at/";
coverElement = (SEARCH HTML-ELEMENT "<img>" WITH ID "imgPage2" AND CLASS "page");
coverUrl = (GET HTML-ATTRIBUTE "src" FROM coverElement);
RETURN coverUrl;

Wouldn't that be a way?

kovidgoyal · 04-29-2012, 08:10 AM

Yes, that is how you would do it.

Spindoctor · 05-05-2012, 03:08 PM

a friend of mine helped me with the title-page.

Here's the new improved .recipe-file for the Austrian newspaper "Der Standard" (http://www.derstandard.at)

Code:

#!/usr/bin/env  python
# -*- coding: utf-8 -*-

__license__   = 'GPL v3'
__copyright__ = '2009, Gerhard Aigner <gerhard.aigner at gmail.com>'

''' http://www.derstandard.at - Austrian Newspaper '''
import re, urllib
from calibre.web.feeds.news import BasicNewsRecipe
from time import strftime

class DerStandardRecipe(BasicNewsRecipe):
    title = u'derStandard'
    __author__ = 'Gerhard Aigner and Sujata Raman and Marcel Jira and Peter Reschenhofer'
    description = u'Nachrichten aus Österreich'
    publisher ='derStandard.at'
    category = 'news, politics, nachrichten, Austria'
    use_embedded_content = False
    remove_empty_feeds = True
    lang = 'de-AT'
    no_stylesheets = True
    encoding = 'utf-8'
    language = 'de'

    oldest_article = 1
    max_articles_per_feed = 100

    extra_css = '''
                .artikelBody{font-family:Arial,Helvetica,sans-serif;}
                .artikelLeft{font-family:Arial,Helvetica,sans-serif;font-size:x-small;}
                h4{color:#404450;font-size:x-small;}
                h6{color:#404450; font-size:x-small;}
                '''
    feeds          = [
        (u'Newsroom', u'http://derStandard.at/?page=rss&ressort=Seite1'),
        (u'Inland', u'http://derstandard.at/?page=rss&ressort=InnenPolitik'),
        (u'International', u'http://derstandard.at/?page=rss&ressort=InternationalPolitik'),
        (u'Wirtschaft', u'http://derStandard.at/?page=rss&ressort=Wirtschaft'),
        (u'Web', u'http://derStandard.at/?page=rss&ressort=Web'),
        (u'Sport', u'http://derStandard.at/?page=rss&ressort=Sport'),
        (u'Panorama', u'http://derStandard.at/?page=rss&ressort=Panorama'),
        (u'Etat', u'http://derStandard.at/?page=rss&ressort=Etat'),
        (u'Kultur', u'http://derStandard.at/?page=rss&ressort=Kultur'),
        (u'Wissenschaft', u'http://derStandard.at/?page=rss&ressort=Wissenschaft'),
        (u'Gesundheit', u'http://derStandard.at/?page=rss&ressort=Gesundheit'),
        (u'Bildung', u'http://derStandard.at/?page=rss&ressort=Bildung'),
        (u'Meinung', u'http://derStandard.at/?page=rss&ressort=Meinung'),
        (u'Lifestyle', u'http://derStandard.at/?page=rss&ressort=Lifestyle'),
        (u'Reisen', u'http://derStandard.at/?page=rss&ressort=Reisen'),
        (u'Karriere', u'http://derStandard.at/?page=rss&ressort=Karriere'),
        (u'Immobilien', u'http://derstandard.at/?page=rss&ressort=Immobilien'),
        (u'dieStandard', u'http://dieStandard.at/?page=rss&ressort=diestandard'),
        (u'daStandard', u'http://daStandard.at/?page=rss&ressort=dastandard')
                      ]

    keep_only_tags = [
                        dict(name='div', attrs={'class':["artikel","artikelLeft","artikelBody"]}) ,
                         ]

    remove_tags = [
                    dict(name='link'), dict(name='meta'),dict(name='iframe'),dict(name='style'),
                    dict(name='form',attrs={'name':'sitesearch'}), dict(name='hr'),
                    dict(name='div', attrs={'class':["diashow"]})]
    preprocess_regexps = [
        (re.compile(r'\[[\d]*\]', re.DOTALL|re.IGNORECASE), lambda match: ''),
        (re.compile(r'bgcolor="#\w{3,6}"', re.DOTALL|re.IGNORECASE), lambda match: '')
    ]

    filter_regexps = [r'/r[1-9]*']

    def get_article_url(self, article):
        '''if the article links to a index page (ressort) or a picture gallery
           (ansichtssache), don't add it'''
        if ( article.link.count('ressort') > 0 or article.title.lower().count('ansichtssache') > 0 ):
            return None
        matchObj = re.search( re.compile(r'/r'+'[1-9]*',flags=0), article.link,flags=0)

        if matchObj:
            return None

        return article.link

    def preprocess_html(self, soup):
        soup.html['xml:lang'] = self.lang
        soup.html['lang']     = self.lang
        mtag = '<meta http-equiv="Content-Type" content="text/html; charset=' + self.encoding + '">'
        soup.head.insert(0,mtag)

        for t in soup.findAll(['ul', 'li']):
            t.name = 'div'
        return soup

    def get_cover_url(self):
        highResolution = True
        
        date    = strftime("%Y/%Y%m%d")
        # it is also possible for the past
        #date    = '2012/20120503'
        
        urlP1   = 'http://epaper.derstandarddigital.at/'
        urlP2   = 'data_ep/STAN/' + date
        urlP3   = '/V.B1/'
        urlP4   = 'paper.htm'
        urlHTML = urlP1 + urlP2 + urlP3 + urlP4
        
        htmlF  = urllib.urlopen(urlHTML)
        htmlC  = htmlF.read()
        
        
        # URL EXAMPLE: data_ep/STAN/2012/20120504/V.B1/pages/A3B6798F-2751-4D8D-A103-C5EF22F7ACBE.htm
        # consists of part2 + part3 + 'pages/' + code
        # 'pages/' has length 6, code has lenght 36
        
        index   = htmlC.find(urlP2) + len(urlP2 + urlP3) + 6 
        code    = htmlC[index:index + 36]
        
        
        # URL EXAMPLE HIGH RESOLUTION: http://epaper.derstandarddigital.at/data_ep/STAN/2012/20120504/pagejpg/A3B6798F-2751-4D8D-A103-C5EF22F7ACBE_b.png
        # URL EXAMPLE LOW RESOLUTION: http://epaper.derstandarddigital.at/data_ep/STAN/2012/20120504/pagejpg/2AB52F71-11C1-4859-9114-CDCD79BEFDCB.png
        
        urlPic  = urlP1 + urlP2 + '/pagejpg/' + code
        
        if highResolution:
            urlPic  = urlPic + '_b'
            
        urlPic  = urlPic + '.png'
        
        return urlPic

Is there another place to upload this recipe, so that it can be added to the next release of Calibre?

Thank you for your help and for Calibre

kovidgoyal · 05-05-2012, 03:37 PM

posting it here is fine.

Spindoctor · 05-08-2012, 03:14 PM

please mind that I made a minor change in the recipe (just the title)

kovidgoyal · 05-09-2012, 01:57 PM

Yeah, I saw that.

04-29-2012, 06:28 AM	#3
Spindoctor Junior Member Posts: 7 Karma: 10 Join Date: Feb 2012 Device: PRS-T1	@kovidgoyal: Thank you for your answer. I did search for a link a bit, but didn't find one. I hoped there is a way like this (in pseudo-code): Code: OPEN URL "http://epaper.derstandarddigital.at/"; coverElement = (SEARCH HTML-ELEMENT "<img>" WITH ID "imgPage2" AND CLASS "page"); coverUrl = (GET HTML-ATTRIBUTE "src" FROM coverElement); RETURN coverUrl; Wouldn't that be a way? Last edited by Spindoctor; 04-29-2012 at 07:00 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Script to scrape page for a cover image for recipe?	adoucette	Recipes	12	02-29-2012 07:24 PM
Cover Image	Todd Young	Writers' Corner	3	09-20-2011 06:16 PM
New recipe voxeu.org - image problem	bosplans	Recipes	3	08-10-2011 07:35 PM
Cover Image	Padr49904	Sigil	7	05-07-2011 07:59 PM
Help with Recipe - Image Sizes	Tegan	Recipes	10	01-14-2011 04:52 PM

04-29-2012, 12:01 AM	#2
kovidgoyal creator of calibre Posts: 46,052 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You need to find a llink to the cover inthe website and extract that.

04-29-2012, 08:10 AM	#4
kovidgoyal creator of calibre Posts: 46,052 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yes, that is how you would do it.

05-05-2012, 03:37 PM	#6
kovidgoyal creator of calibre Posts: 46,052 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	posting it here is fine.

05-08-2012, 03:14 PM	#7
Spindoctor Junior Member Posts: 7 Karma: 10 Join Date: Feb 2012 Device: PRS-T1	please mind that I made a minor change in the recipe (just the title)

05-09-2012, 01:57 PM	#8
kovidgoyal creator of calibre Posts: 46,052 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yeah, I saw that.