Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-05-2019, 04:58 PM   #1
Flugschwein
Junior Member
Flugschwein began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Mar 2019
Device: PocketBook Touch HD 3
derStandard default recipe not working

The default recipe for one for one of the most famous quality newspapers in Austria, der Standard (or derStandard), is pretty outdated and doesn't work anymore at all.
This is what the current default looks like:
Spoiler:
Code:
#!/usr/bin/env  python2
# -*- coding: utf-8 -*-
from __future__ import unicode_literals, division, absolute_import, print_function

__license__ = 'GPL v3'
__copyright__ = '2009, Gerhard Aigner <gerhard.aigner at gmail.com>'

''' http://www.derstandard.at - Austrian Newspaper '''

import re
import random
from calibre.web.feeds.news import BasicNewsRecipe


class DerStandardRecipe(BasicNewsRecipe):
    title = u'derStandard'
    __author__ = 'Gerhard Aigner and Sujata Raman and Marcel Jira and Peter Reschenhofer'
    description = u'Nachrichten aus Österreich'
    publisher = 'derStandard.at'
    category = 'news, politics, nachrichten, Austria'
    use_embedded_content = False
    remove_empty_feeds = True
    no_stylesheets = True
    encoding = 'utf-8'
    language = 'de_AT'

    oldest_article = 1
    max_articles_per_feed = 100
    ignore_duplicate_articles = {'title', 'url'}

    masthead_url = 'http://images.derstandard.at/2012/06/19/derStandardat_1417x274.gif'

    feeds = [
        (u'Newsroom', u'http://derStandard.at/?page=rss&ressort=Seite1'),
        (u'Inland', u'http://derstandard.at/?page=rss&ressort=InnenPolitik'),
        (u'International', u'http://derstandard.at/?page=rss&ressort=InternationalPolitik'),
        (u'Wirtschaft', u'http://derStandard.at/?page=rss&ressort=Wirtschaft'),
        (u'Web', u'http://derStandard.at/?page=rss&ressort=Web'),
        (u'Sport', u'http://derStandard.at/?page=rss&ressort=Sport'),
        (u'Panorama', u'http://derStandard.at/?page=rss&ressort=Panorama'),
        (u'Etat', u'http://derStandard.at/?page=rss&ressort=Etat'),
        (u'Kultur', u'http://derStandard.at/?page=rss&ressort=Kultur'),
        (u'Wissenschaft', u'http://derStandard.at/?page=rss&ressort=Wissenschaft'),
        (u'Gesundheit', u'http://derStandard.at/?page=rss&ressort=Gesundheit'),
        (u'Bildung', u'http://derStandard.at/?page=rss&ressort=Bildung'),
        (u'Meinung', u'http://derStandard.at/?page=rss&ressort=Meinung'),
        (u'Lifestyle', u'http://derStandard.at/?page=rss&ressort=Lifestyle'),
        (u'Reisen', u'http://derStandard.at/?page=rss&ressort=Reisen'),
        (u'Familie', u'http://derstandard.at/?page=rss&ressort=Familie'),
        (u'Greenlife', u'http://derStandard.at/?page=rss&ressort=Greenlife'),
        (u'Karriere', u'http://derStandard.at/?page=rss&ressort=Karriere'),
        (u'Immobilien', u'http://derstandard.at/?page=rss&ressort=Immobilien'),
        (u'Automobil', u'http://derstandard.at/?page=rss&ressort=Automobil'),
        (u'dieStandard', u'http://dieStandard.at/?page=rss&ressort=diestandard'),
        (u'daStandard', u'http://daStandard.at/?page=rss&ressort=dastandard')
    ]

    keep_only_tags = [
        dict(name='div', attrs={'class': re.compile('^artikel')})
    ]

    remove_tags = [
        dict(name=['link', 'iframe', 'style', 'hr']),
        dict(attrs={'class': ['lookup-links', 'media-list']}),
        dict(name='form', attrs={'name': 'sitesearch'}),
        dict(name='div', attrs={'class': ['socialsharing', 'block video',
                                          'blog-browsing section',
                                          'diashow', 'supplemental']}),
        dict(name='div', attrs={'id': 'highlighted'})
    ]

    remove_attributes = ['width', 'height']

    preprocess_regexps = [
        (re.compile(r'\[[\d]*\]', re.DOTALL |
                    re.IGNORECASE), lambda match: ''),
        (re.compile(r'bgcolor="#\w{3,6}"',
                    re.DOTALL | re.IGNORECASE), lambda match: '')
    ]

    filter_regexps = [r'/r[1-9]*']

    def get_article_url(self, article):
        matchObj = re.search(re.compile(
            r'/r' + '[1-9]*', flags=0), article.link, flags=0)

        if matchObj:
            return None

        return article.link

    def preprocess_html(self, soup):
        if soup.find('div', {'class': re.compile('^artikel')}) is None:
            self.abort_article()
        for t in soup.findAll(['ul', 'li']):
            t.name = 'div'
        return soup

    def get_cover_url(self):
        base_url = 'https://epaper.derstandard.at/'
        url = base_url + 'shelf.act?s=' + str(random.random() * 10000)
        soup = self.index_to_soup(url)
        img = soup.find(
            'img', {'class': re.compile('^thumbnailBig'), 'src': True})
        if img and img['src']:
            cover_url = base_url + img['src']
            return cover_url


While this might've worked a few years ago, it seems the Standard has changed it's feeds and some other stuff regarding their online presence as well.
As you can see here (attention: German. Use a translator if you need to understand that, but it should be rather self explanatory), the feeds list is no longer up to date. It should be like that instead if I understood that part correctly:
Code:
feeds = [
        (u'Newsroom', u'http://derStandard.at/?page=rss&ressort=Seite1'),
        (u'International', u'http://derstandard.at/?page=rss&ressort=International'),
        (u'Inland', u'http://derstandard.at/?page=rss&ressort=Inland'),
        (u'Wirtschaft', u'http://derStandard.at/?page=rss&ressort=Wirtschaft'),
        (u'Web', u'http://derStandard.at/?page=rss&ressort=Web'),
        (u'Sport', u'http://derStandard.at/?page=rss&ressort=Sport'),
        (u'Panorama', u'http://derStandard.at/?page=rss&ressort=Panorama'),
        (u'Etat', u'http://derStandard.at/?page=rss&ressort=Etat'),
        (u'Kultur', u'http://derStandard.at/?page=rss&ressort=Kultur'),
        (u'Wissenschaft', u'http://derStandard.at/?page=rss&ressort=Wissenschaft'),
        (u'Gesundheit', u'http://derStandard.at/?page=rss&ressort=Gesundheit'),
        (u'Bildung', u'http://derStandard.at/?page=rss&ressort=Bildung'),
        (u'Meinung', u'http://derStandard.at/?page=rss&ressort=Meinung'),
        (u'Lifestyle', u'http://derStandard.at/?page=rss&ressort=Lifestyle'),
        (u'Reisen', u'http://derStandard.at/?page=rss&ressort=Reisen'),
        (u'Familie', u'http://derstandard.at/?page=rss&ressort=Familie'),
        (u'Meinung', u'http://derStandard.at/?page=rss&ressort=Meinung'),
        (u'User', u'http://derStandard.at/?page=rss&ressort=User'),
        (u'Karriere', u'http://derStandard.at/?page=rss&ressort=Karriere'),
        (u'Immobilien', u'http://derstandard.at/?page=rss&ressort=Immobilien'),
        (u'Automobil', u'http://derstandard.at/?page=rss&ressort=Automobil'),
        (u'dieStandard', u'http://derStandard.at/?page=rss&ressort=diestandard'),
    ]
But even if the feeds are correct, the output you get isn't. Sadly I currently lack the resources to get it working myself, so I would really appreciate if a more experienced user could provide me with a working version of the recipe (and maybe even open a Pull Request to the Git repository ;-) )
Thanks in advance,
Flugschwein

PS: according to https://calibre-ebook.com/dynamic/recipe-usage derStandard is the 4th most downloaded German (language wise, not nationality) newspaper using calibre
Flugschwein is offline   Reply With Quote
Old 03-07-2019, 10:43 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,742
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
https://github.com/kovidgoyal/calibr...0d2a61d36cf347
kovidgoyal is online now   Reply With Quote
Old 03-09-2019, 02:20 PM   #3
Flugschwein
Junior Member
Flugschwein began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Mar 2019
Device: PocketBook Touch HD 3
Exclamation

Quote:
Originally Posted by kovidgoyal View Post
Sorry to say that, but for me that script still does not work correctly. I attached the epub that I got out of my download, and if you go to the source of one of the articles, you can see that only the beginning of each article has been parsed correctly. I hope you can get it working correctly
Thanks a lot!
Flugschwein is offline   Reply With Quote
Old 03-10-2019, 08:03 PM   #4
lui1
Enthusiast
lui1 began at the beginning.
 
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
update to derStandard

I think this fixes the problem.

Recipe for derStandard:
Code:
#!/usr/bin/env  python2
# -*- coding: utf-8 -*-
from __future__ import unicode_literals, division, absolute_import, print_function

__license__ = 'GPL v3'
__copyright__ = '2009, Gerhard Aigner <gerhard.aigner at gmail.com>'

''' http://www.derstandard.at - Austrian Newspaper '''

from calibre.web.feeds.news import BasicNewsRecipe

def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(attrs={
        'class': lambda x: x and frozenset(x.split()).intersection(q)})


class DerStandardRecipe(BasicNewsRecipe):
    title = u'derStandard'
    __author__ = 'Gerhard Aigner and Sujata Raman and Marcel Jira and Peter Reschenhofer'
    description = u'Nachrichten aus Österreich'
    publisher = 'derStandard.at'
    category = 'news, politics, nachrichten, Austria'
    use_embedded_content = False
    remove_empty_feeds = True
    no_stylesheets = True
    encoding = 'utf-8'
    language = 'de_AT'

    oldest_article = 1
    max_articles_per_feed = 100
    ignore_duplicate_articles = {'title', 'url'}

    masthead_url = 'http://images.derstandard.at/2012/06/19/derStandardat_1417x274.gif'

    feeds = [
        (u'Newsroom', u'https://derStandard.at/?page=rss&ressort=Seite1'),
        (u'International', u'https://derstandard.at/?page=rss&ressort=International'),
        (u'Inland', u'https://derstandard.at/?page=rss&ressort=Inland'),
        (u'Wirtschaft', u'https://derStandard.at/?page=rss&ressort=Wirtschaft'),
        (u'Web', u'https://derStandard.at/?page=rss&ressort=Web'),
        (u'Sport', u'https://derStandard.at/?page=rss&ressort=Sport'),
        (u'Panorama', u'https://derStandard.at/?page=rss&ressort=Panorama'),
        (u'Etat', u'https://derStandard.at/?page=rss&ressort=Etat'),
        (u'Kultur', u'https://derStandard.at/?page=rss&ressort=Kultur'),
        (u'Wissenschaft', u'https://derStandard.at/?page=rss&ressort=Wissenschaft'),
        (u'Gesundheit', u'https://derStandard.at/?page=rss&ressort=Gesundheit'),
        (u'Bildung', u'https://derStandard.at/?page=rss&ressort=Bildung'),
        (u'Meinung', u'https://derStandard.at/?page=rss&ressort=Meinung'),
        (u'Lifestyle', u'https://derStandard.at/?page=rss&ressort=Lifestyle'),
        (u'Reisen', u'https://derStandard.at/?page=rss&ressort=Reisen'),
        (u'Familie', u'https://derstandard.at/?page=rss&ressort=Familie'),
        (u'Meinung', u'https://derStandard.at/?page=rss&ressort=Meinung'),
        (u'User', u'https://derStandard.at/?page=rss&ressort=User'),
        (u'Karriere', u'https://derStandard.at/?page=rss&ressort=Karriere'),
        (u'Immobilien', u'https://derstandard.at/?page=rss&ressort=Immobilien'),
        (u'Automobil', u'https://derstandard.at/?page=rss&ressort=Automobil'),
        (u'dieStandard', u'https://derStandard.at/?page=rss&ressort=diestandard'),
    ]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        headers = {
            'X-Requested-With': 'XMLHttpRequest',
            'Content-Type': 'application/json; charset=UTF-8',
            'DNT': '1',
            'Pragma': 'no-cache',
            'Cache-Control': 'no-cache'
        }
        import mechanize
        req =   mechanize.Request(url='https://derstandard.at/privacyprotection/api/agree', data=None, headers=headers, method='POST')
        br.open(req)
        return br

    keep_only_tags = [
        classes('artikel'),
    ]

    remove_tags = [
        dict(name=['link', 'iframe', 'style', 'hr']),
        dict(attrs={'class': ['lookup-links', 'media-list']}),
        dict(name='form', attrs={'name': 'sitesearch'}),
        dict(name='div', attrs={'class': ['socialsharing', 'block video',
                                          'blog-browsing section',
                                          'diashow', 'supplemental']}),
        dict(name='div', attrs={'id': 'highlighted'})
    ]

    remove_attributes = ['width', 'height']
lui1 is offline   Reply With Quote
Old 03-12-2019, 01:00 PM   #5
Flugschwein
Junior Member
Flugschwein began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Mar 2019
Device: PocketBook Touch HD 3
Talking

Quote:
Originally Posted by lui1 View Post
I think this fixes the problem.
It really does! Thanks a lot!
I can finally comfortably read that newspaper on my PB

Flugschwein is offline   Reply With Quote
Reply

Tags
austria, calibre, derstandard, german, recipe

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
The Hindu Recipe(Better than the default provided with calibre) sexymax15 Recipes 7 04-26-2017 11:43 AM
Improving derStandard-recipe - how to get cover image? Spindoctor Recipes 7 05-09-2012 12:57 PM
Custom news recipe default language ppclarke Recipes 2 04-13-2012 07:58 AM
Strange problem with new default dictionary not working frodon Amazon Kindle 4 04-24-2011 04:46 AM
Recipe not working phkoech Calibre 3 08-13-2009 05:41 PM


All times are GMT -4. The time now is 11:17 PM.


MobileRead.com is a privately owned, operated and funded community.