Handelsblatt recipe

Aimylios · 04-03-2016, 04:56 PM

Hi,

the current Handelsblatt recipe is broken for quite some time. When trying to fix it I found out that the article structure has changed fundamentally, so I had to write a completely new one.

It would of course be great if the old handelsblatt.recipe could be replaced by this new code.

Code:

#!/usr/bin/env python2

__license__   = 'GPL v3'
__copyright__ = '2016, Aimylios'

'''
handelsblatt.com
'''

import re
from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title            = u'Handelsblatt'
    __author__       = 'Aimylios' # based on the work of malfi and Hegi
    description      = u'RSS-Feeds von Handelsblatt.com'
    publisher        = 'Verlagsgruppe Handelsblatt GmbH'
    category         = 'news, politics, business, economy, Germany'
    publication_type = 'newspaper'
    language         = 'de'

    encoding                  = 'utf8'
    oldest_article            = 4
    max_articles_per_feed     = 30
    simultaneous_downloads    = 20
    no_stylesheets            = True
    remove_javascript         = True
    remove_empty_feeds        = True
    ignore_duplicate_articles = {'title', 'url'}

    conversion_options = {'smarten_punctuation' : True,
                          'publisher'           : publisher}

    # uncomment this to reduce file size
    #  compress_news_images = True

    cover_source = 'https://kaufhaus.handelsblatt.com/downloads/handelsblatt-epaper-p1951.html'
    masthead_url = 'http://www.handelsblatt.com/images/logo_handelsblatt/11002806/7-formatOriginal.png'
    #masthead_url = 'http://www.handelsblatt.com/images/hb_logo/6543086/1-format3.jpg'
    #masthead_url = 'http://www.handelsblatt-chemie.de/wp-content/uploads/2012/01/hb-logo.gif'

    feeds = [
              (u'Top-Themen', u'http://www.handelsblatt.com/contentexport/feed/top-themen'),
              (u'Politik', u'http://www.handelsblatt.com/contentexport/feed/politik'),
              (u'Unternehmen', u'http://www.handelsblatt.com/contentexport/feed/unternehmen'),
              (u'Finanzen', u'http://www.handelsblatt.com/contentexport/feed/finanzen'),
              (u'Technologie', u'http://www.handelsblatt.com/contentexport/feed/technologie'),
              (u'Panorama', u'http://www.handelsblatt.com/contentexport/feed/panorama'),
              (u'Sport', u'http://www.handelsblatt.com/contentexport/feed/sport')
            ]

    keep_only_tags = [ dict(name='div', attrs={'class':['vhb-article-container']}) ]

    remove_tags = [
                    dict(name='span', attrs={'class':['vhb-media', 'vhb-colon']}),
                    dict(name='small', attrs={'class':['vhb-credit']}),
                    dict(name='aside', attrs={'class':['vhb-article-element vhb-left',
                                                       'vhb-article-element vhb-left vhb-teasergallery',
                                                       'vhb-article-element vhb-left vhb-shorttexts']}),
                    dict(name='article', attrs={'class':['vhb-imagegallery vhb-teaser', 'vhb-teaser vhb-type-video']}),
                    dict(name='div', attrs={'class':['fb-post']}),
                    dict(name='blockquote', attrs={'class':['twitter-tweet']}),
                    dict(name='a', attrs={'class':['twitter-follow-button']})
                  ]

    preprocess_regexps = [
                           # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
                           (re.compile(r'(<span class="hcf-location-mark">[^<]+)(</span>)',
                           re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '.&nbsp;' + match.group(2)),
                           # Insert ": " after "Title" in <em itemtype="text" itemprop="name" class="vhb-title">Title</em>
                           (re.compile(r'(<em itemtype="text" itemprop="name" class="vhb-title">[^<]+)(</em>)',
                           re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ':&nbsp;' + match.group(2))
                         ]

    extra_css = 'h2 {text-align: left} \
                 h3 {font-size: 1em; text-align: left} \
                 h4 {font-size: 1em; text-align: left; margin-bottom: 0em} \
                 em {font-style: normal; font-weight: bold} \
                 .vhb-subline {font-size: 0.6em; text-transform: uppercase} \
                 .vhb-article-caption {float: left; padding-right: 0.2em} \
                 .vhb-article-author-cell ul {list-style-type: none; margin: 0em} \
                 .vhb-teaser-head {margin-top: 1em; margin-bottom: 1em} \
                 .vhb-caption-wrapper {font-size: 0.6em} \
                 .hcf-location-mark {font-weight: bold} \
                 .panel-link {color: black; text-decoration: none} \
                 .panel-body p {margin-top: 0em}'

    def get_cover_url(self):
        soup = self.index_to_soup(self.cover_source)
        style = soup.find('img', alt='Handelsblatt ePaper', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url

    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

    def preprocess_html(self, soup):
        # remove all articles without relevant content (e.g., videos)
        article_container = soup.find('div', {'class':'vhb-article-container'})
        if article_container == None:
            self.abort_article()
        else:
            return soup

    def postprocess_html(self, soup, first_fetch):
        # make sure that all figure captions (including the source) are shown
        # without linebreaks by using the alternative text given within <img/>
        # instead of the original text (which is oddly formatted)
        article_figures = soup.findAll('figure', {'class':'vhb-image'})
        for fig in article_figures:
            fig.find('div', {'class':'vhb-caption'}).replaceWith(fig.find('img')['alt'])
        return soup

kovidgoyal · 04-03-2016, 09:31 PM

done, thanks.

Aimylios · 04-11-2016, 01:07 PM

Hi Kovid,

thanks for integrating my recipe into Calibre!
In the meantime I have been working on an improved version, please find the new code below.

Changelog:

added subscription support
modify formatting of author and date lists instead of using CSS magic
remove local hyperlinks
code cleanup

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals

__license__   = 'GPL v3'
__copyright__ = '2016, Aimylios'

'''
handelsblatt.com
'''

import re
from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title              = 'Handelsblatt'
    __author__         = 'Aimylios' # based on the work of malfi and Hegi
    description        = 'RSS-Feeds von Handelsblatt.com'
    publisher          = 'Verlagsgruppe Handelsblatt GmbH'
    publication_type   = 'newspaper'
    needs_subscription = 'optional'
    language           = 'de'
    encoding           = 'utf-8'

    oldest_article            = 4
    max_articles_per_feed     = 30
    simultaneous_downloads    = 20
    no_stylesheets            = True
    remove_javascript         = True
    remove_empty_feeds        = True
    ignore_duplicate_articles = {'title', 'url'}

    conversion_options = {'smarten_punctuation' : True,
                          'publisher'           : publisher}

    # uncomment this to reduce file size
    #compress_news_images          = True
    #compress_news_images_max_size = 16

    cover_source = 'https://kaufhaus.handelsblatt.com/downloads/handelsblatt-epaper-p1951.html'
    masthead_url = 'http://www.handelsblatt.com/images/logo_handelsblatt/11002806/7-formatOriginal.png'

    feeds = [
        ('Top-Themen', 'http://www.handelsblatt.com/contentexport/feed/top-themen'),
        ('Politik', 'http://www.handelsblatt.com/contentexport/feed/politik'),
        ('Unternehmen', 'http://www.handelsblatt.com/contentexport/feed/unternehmen'),
        ('Finanzen', 'http://www.handelsblatt.com/contentexport/feed/finanzen'),
        ('Technologie', 'http://www.handelsblatt.com/contentexport/feed/technologie'),
        ('Panorama', 'http://www.handelsblatt.com/contentexport/feed/panorama'),
        ('Sport', 'http://www.handelsblatt.com/contentexport/feed/sport')
    ]

    keep_only_tags = [dict(name='div', attrs={'class':['vhb-article-container']})]

    remove_tags = [
        dict(name='span', attrs={'class':['vhb-colon', 'vhb-label-premium']}),
        dict(name='aside', attrs={'class':['vhb-article-element vhb-left',
                                           'vhb-article-element vhb-left vhb-teasergallery',
                                           'vhb-article-element vhb-left vhb-shorttexts']}),
        dict(name='article', attrs={'class':['vhb-imagegallery vhb-teaser',
                                             'vhb-teaser vhb-type-video']}),
        dict(name='small', attrs={'class':['vhb-credit']}),
        dict(name='div', attrs={'class':['white_content', 'fb-post']}),
        dict(name='a', attrs={'class':['twitter-follow-button']}),
        dict(name='blockquote')
    ]

    preprocess_regexps = [
        # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
        (re.compile(r'(<span class="hcf-location-mark">[^<]+)(</span>)',
        re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '.&nbsp;' + match.group(2)),
        # Insert ": " after "Title" in <em itemtype="text" itemprop="name" class="vhb-title">Title</em>
        (re.compile(r'(<em itemtype="text" itemprop="name" class="vhb-title">[^<]+)(</em>)',
        re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ':&nbsp;' + match.group(2))
    ]

    extra_css = 'h2 {text-align: left} \
                 h3 {font-size: 1em; text-align: left} \
                 h4 {font-size: 1em; text-align: left; margin-bottom: 0em} \
                 em {font-style: normal; font-weight: bold} \
                 .vhb-subline {font-size: 0.6em; text-transform: uppercase} \
                 .vhb-teaser-head {margin-top: 1em; margin-bottom: 1em} \
                 .vhb-caption-wrapper {font-size: 0.6em} \
                 .hcf-location-mark {font-weight: bold} \
                 .panel-body p {margin-top: 0em}'

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            br.open('https://profil.vhb.de/sso/login?service=http://www.handelsblatt.com')
            br.select_form(nr=0)
            br['username'] = self.username
            br['password'] = self.password
            br.submit()
        return br

    def get_cover_url(self):
        soup = self.index_to_soup(self.cover_source)
        style = soup.find('img', alt='Handelsblatt ePaper', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url

    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

    def preprocess_html(self, soup):
        # remove all articles without relevant content (e.g., videos)
        article_container = soup.find('div', {'class':'vhb-article-container'})
        if article_container is None:
            self.abort_article()
        else:
            # remove all local hyperlinks
            for a in soup.findAll('a', {'href':True}):
                if a['href'] and a['href'][0] in ['/', '#']:
                    a.replaceWith(a.renderContents())
            return soup

    def postprocess_html(self, soup, first_fetch):
        # convert lists of author(s) and date(s) into simple text
        for cap in soup.findAll('div', {'class':re.compile('.*vhb-article-caption')}):
            cap.replaceWith(cap.renderContents())
        for row in soup.findAll('div', {'class':'vhb-article-author-row'}):
            for ul in row.findAll('ul'):
                entry = ''
                for li in ul.findAll(lambda tag: tag.name == 'li' and not tag.attrs):
                    entry = entry + li.renderContents() + ', '
                for li in ul.findAll(lambda tag: tag.name == 'li' and tag.attrs):
                    entry = entry + li.renderContents() + '<br/>'
                ul.parent.replaceWith(entry)
        # make sure that all figure captions (including the source) are shown
        # without linebreaks by using the alternative text given within <img/>
        # instead of the original text (which is oddly formatted)
        for fig in soup.findAll('figure', {'class':'vhb-image'}):
            fig.find('div', {'class':'vhb-caption'}).replaceWith(fig.find('img')['alt'])
        return soup

FLR · 12-01-2017, 03:05 AM

Hello,

it seems to me that the download of Handelsblatt premium articles is broken at the moment. I have a valid user account but the premium articles do not show up in the downloaded mobi-file. The free articles work just fine.

I tried this on Calibre for Windows and over the command line interface on linux.

On Linux I see alot of "Failed to download article:" and "Could not fetch link" messages while downloading which seem to refer to the paid articles. In the browser those articles are accesible after login and also part of the epaper.

On Linux I use

ebook-convert "handelsblatt.recipe" [destination].mobi --output-profile kindle --username ***** --password *****

for this.

Is it possible that there is something wrong with the login of the recipe at the moment?

Thank you very much for any help!

Aimylios · 12-02-2017, 10:36 AM

Hi FLR,

unfortunately I don't have a subscription any more and only use it for free articles. At least the login URL is still valid.
If you want, you can send me your login credentials via PM and I'll try to fix it.

12-01-2017, 03:05 AM	#4
FLR Junior Member Posts: 1 Karma: 10 Join Date: Dec 2017 Device: Kindle Oasis	Handelsblatt Recipe broke for premium subscribers? Hello, it seems to me that the download of Handelsblatt premium articles is broken at the moment. I have a valid user account but the premium articles do not show up in the downloaded mobi-file. The free articles work just fine. I tried this on Calibre for Windows and over the command line interface on linux. On Linux I see alot of "Failed to download article:" and "Could not fetch link" messages while downloading which seem to refer to the paid articles. In the browser those articles are accesible after login and also part of the epaper. On Linux I use ebook-convert "handelsblatt.recipe" [destination].mobi --output-profile kindle --username *** --password *** for this. Is it possible that there is something wrong with the login of the recipe at the moment? Thank you very much for any help!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Re-Write for Handelsblatt Recipe (German Business Newspaper)	hegi	Recipes	1	02-14-2015 07:40 AM
Recipe for EPUB subscribers of "Tagesspiegel" and "Handelsblatt"?	F.W.	Recipes	0	05-14-2013 11:16 AM
Handelsblatt	malfi	Recipes	14	04-23-2011 01:33 PM
Handelsblatt recipe no longer works	Dereks	Recipes	1	03-20-2011 07:22 PM
Recipe Request for Handelsblatt [GER]	Moik	Recipes	6	10-15-2010 07:13 PM

04-03-2016, 09:31 PM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	done, thanks.

12-02-2017, 10:36 AM	#5
Aimylios Member Posts: 16 Karma: 10 Join Date: Apr 2016 Device: Tolino Vision 3HD	Hi FLR, unfortunately I don't have a subscription any more and only use it for free articles. At least the login URL is still valid. If you want, you can send me your login credentials via PM and I'll try to fix it.

Advert

Advert