Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-03-2016, 04:56 PM   #1
Aimylios
Member
Aimylios began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
Handelsblatt recipe

Hi,

the current Handelsblatt recipe is broken for quite some time. When trying to fix it I found out that the article structure has changed fundamentally, so I had to write a completely new one.

It would of course be great if the old handelsblatt.recipe could be replaced by this new code.

Code:
#!/usr/bin/env python2

__license__   = 'GPL v3'
__copyright__ = '2016, Aimylios'

'''
handelsblatt.com
'''

import re
from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title            = u'Handelsblatt'
    __author__       = 'Aimylios' # based on the work of malfi and Hegi
    description      = u'RSS-Feeds von Handelsblatt.com'
    publisher        = 'Verlagsgruppe Handelsblatt GmbH'
    category         = 'news, politics, business, economy, Germany'
    publication_type = 'newspaper'
    language         = 'de'

    encoding                  = 'utf8'
    oldest_article            = 4
    max_articles_per_feed     = 30
    simultaneous_downloads    = 20
    no_stylesheets            = True
    remove_javascript         = True
    remove_empty_feeds        = True
    ignore_duplicate_articles = {'title', 'url'}

    conversion_options = {'smarten_punctuation' : True,
                          'publisher'           : publisher}

    # uncomment this to reduce file size
    #  compress_news_images = True

    cover_source = 'https://kaufhaus.handelsblatt.com/downloads/handelsblatt-epaper-p1951.html'
    masthead_url = 'http://www.handelsblatt.com/images/logo_handelsblatt/11002806/7-formatOriginal.png'
    #masthead_url = 'http://www.handelsblatt.com/images/hb_logo/6543086/1-format3.jpg'
    #masthead_url = 'http://www.handelsblatt-chemie.de/wp-content/uploads/2012/01/hb-logo.gif'

    feeds = [
              (u'Top-Themen', u'http://www.handelsblatt.com/contentexport/feed/top-themen'),
              (u'Politik', u'http://www.handelsblatt.com/contentexport/feed/politik'),
              (u'Unternehmen', u'http://www.handelsblatt.com/contentexport/feed/unternehmen'),
              (u'Finanzen', u'http://www.handelsblatt.com/contentexport/feed/finanzen'),
              (u'Technologie', u'http://www.handelsblatt.com/contentexport/feed/technologie'),
              (u'Panorama', u'http://www.handelsblatt.com/contentexport/feed/panorama'),
              (u'Sport', u'http://www.handelsblatt.com/contentexport/feed/sport')
            ]

    keep_only_tags = [ dict(name='div', attrs={'class':['vhb-article-container']}) ]

    remove_tags = [
                    dict(name='span', attrs={'class':['vhb-media', 'vhb-colon']}),
                    dict(name='small', attrs={'class':['vhb-credit']}),
                    dict(name='aside', attrs={'class':['vhb-article-element vhb-left',
                                                       'vhb-article-element vhb-left vhb-teasergallery',
                                                       'vhb-article-element vhb-left vhb-shorttexts']}),
                    dict(name='article', attrs={'class':['vhb-imagegallery vhb-teaser', 'vhb-teaser vhb-type-video']}),
                    dict(name='div', attrs={'class':['fb-post']}),
                    dict(name='blockquote', attrs={'class':['twitter-tweet']}),
                    dict(name='a', attrs={'class':['twitter-follow-button']})
                  ]

    preprocess_regexps = [
                           # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
                           (re.compile(r'(<span class="hcf-location-mark">[^<]+)(</span>)',
                           re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '.&nbsp;' + match.group(2)),
                           # Insert ": " after "Title" in <em itemtype="text" itemprop="name" class="vhb-title">Title</em>
                           (re.compile(r'(<em itemtype="text" itemprop="name" class="vhb-title">[^<]+)(</em>)',
                           re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ':&nbsp;' + match.group(2))
                         ]

    extra_css = 'h2 {text-align: left} \
                 h3 {font-size: 1em; text-align: left} \
                 h4 {font-size: 1em; text-align: left; margin-bottom: 0em} \
                 em {font-style: normal; font-weight: bold} \
                 .vhb-subline {font-size: 0.6em; text-transform: uppercase} \
                 .vhb-article-caption {float: left; padding-right: 0.2em} \
                 .vhb-article-author-cell ul {list-style-type: none; margin: 0em} \
                 .vhb-teaser-head {margin-top: 1em; margin-bottom: 1em} \
                 .vhb-caption-wrapper {font-size: 0.6em} \
                 .hcf-location-mark {font-weight: bold} \
                 .panel-link {color: black; text-decoration: none} \
                 .panel-body p {margin-top: 0em}'

    def get_cover_url(self):
        soup = self.index_to_soup(self.cover_source)
        style = soup.find('img', alt='Handelsblatt ePaper', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url

    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

    def preprocess_html(self, soup):
        # remove all articles without relevant content (e.g., videos)
        article_container = soup.find('div', {'class':'vhb-article-container'})
        if article_container == None:
            self.abort_article()
        else:
            return soup

    def postprocess_html(self, soup, first_fetch):
        # make sure that all figure captions (including the source) are shown
        # without linebreaks by using the alternative text given within <img/>
        # instead of the original text (which is oddly formatted)
        article_figures = soup.findAll('figure', {'class':'vhb-image'})
        for fig in article_figures:
            fig.find('div', {'class':'vhb-caption'}).replaceWith(fig.find('img')['alt'])
        return soup
Aimylios is offline   Reply With Quote
Old 04-03-2016, 09:31 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
done, thanks.
kovidgoyal is offline   Reply With Quote
Advert
Old 04-11-2016, 01:07 PM   #3
Aimylios
Member
Aimylios began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
Hi Kovid,

thanks for integrating my recipe into Calibre!
In the meantime I have been working on an improved version, please find the new code below.

Changelog:
  • added subscription support
  • modify formatting of author and date lists instead of using CSS magic
  • remove local hyperlinks
  • code cleanup

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals

__license__   = 'GPL v3'
__copyright__ = '2016, Aimylios'

'''
handelsblatt.com
'''

import re
from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title              = 'Handelsblatt'
    __author__         = 'Aimylios' # based on the work of malfi and Hegi
    description        = 'RSS-Feeds von Handelsblatt.com'
    publisher          = 'Verlagsgruppe Handelsblatt GmbH'
    publication_type   = 'newspaper'
    needs_subscription = 'optional'
    language           = 'de'
    encoding           = 'utf-8'

    oldest_article            = 4
    max_articles_per_feed     = 30
    simultaneous_downloads    = 20
    no_stylesheets            = True
    remove_javascript         = True
    remove_empty_feeds        = True
    ignore_duplicate_articles = {'title', 'url'}

    conversion_options = {'smarten_punctuation' : True,
                          'publisher'           : publisher}

    # uncomment this to reduce file size
    #compress_news_images          = True
    #compress_news_images_max_size = 16

    cover_source = 'https://kaufhaus.handelsblatt.com/downloads/handelsblatt-epaper-p1951.html'
    masthead_url = 'http://www.handelsblatt.com/images/logo_handelsblatt/11002806/7-formatOriginal.png'

    feeds = [
        ('Top-Themen', 'http://www.handelsblatt.com/contentexport/feed/top-themen'),
        ('Politik', 'http://www.handelsblatt.com/contentexport/feed/politik'),
        ('Unternehmen', 'http://www.handelsblatt.com/contentexport/feed/unternehmen'),
        ('Finanzen', 'http://www.handelsblatt.com/contentexport/feed/finanzen'),
        ('Technologie', 'http://www.handelsblatt.com/contentexport/feed/technologie'),
        ('Panorama', 'http://www.handelsblatt.com/contentexport/feed/panorama'),
        ('Sport', 'http://www.handelsblatt.com/contentexport/feed/sport')
    ]

    keep_only_tags = [dict(name='div', attrs={'class':['vhb-article-container']})]

    remove_tags = [
        dict(name='span', attrs={'class':['vhb-colon', 'vhb-label-premium']}),
        dict(name='aside', attrs={'class':['vhb-article-element vhb-left',
                                           'vhb-article-element vhb-left vhb-teasergallery',
                                           'vhb-article-element vhb-left vhb-shorttexts']}),
        dict(name='article', attrs={'class':['vhb-imagegallery vhb-teaser',
                                             'vhb-teaser vhb-type-video']}),
        dict(name='small', attrs={'class':['vhb-credit']}),
        dict(name='div', attrs={'class':['white_content', 'fb-post']}),
        dict(name='a', attrs={'class':['twitter-follow-button']}),
        dict(name='blockquote')
    ]

    preprocess_regexps = [
        # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
        (re.compile(r'(<span class="hcf-location-mark">[^<]+)(</span>)',
        re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '.&nbsp;' + match.group(2)),
        # Insert ": " after "Title" in <em itemtype="text" itemprop="name" class="vhb-title">Title</em>
        (re.compile(r'(<em itemtype="text" itemprop="name" class="vhb-title">[^<]+)(</em>)',
        re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ':&nbsp;' + match.group(2))
    ]

    extra_css = 'h2 {text-align: left} \
                 h3 {font-size: 1em; text-align: left} \
                 h4 {font-size: 1em; text-align: left; margin-bottom: 0em} \
                 em {font-style: normal; font-weight: bold} \
                 .vhb-subline {font-size: 0.6em; text-transform: uppercase} \
                 .vhb-teaser-head {margin-top: 1em; margin-bottom: 1em} \
                 .vhb-caption-wrapper {font-size: 0.6em} \
                 .hcf-location-mark {font-weight: bold} \
                 .panel-body p {margin-top: 0em}'

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            br.open('https://profil.vhb.de/sso/login?service=http://www.handelsblatt.com')
            br.select_form(nr=0)
            br['username'] = self.username
            br['password'] = self.password
            br.submit()
        return br

    def get_cover_url(self):
        soup = self.index_to_soup(self.cover_source)
        style = soup.find('img', alt='Handelsblatt ePaper', style=True)['style']
        self.cover_url = style.partition('(')[-1].rpartition(')')[0]
        return self.cover_url

    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

    def preprocess_html(self, soup):
        # remove all articles without relevant content (e.g., videos)
        article_container = soup.find('div', {'class':'vhb-article-container'})
        if article_container is None:
            self.abort_article()
        else:
            # remove all local hyperlinks
            for a in soup.findAll('a', {'href':True}):
                if a['href'] and a['href'][0] in ['/', '#']:
                    a.replaceWith(a.renderContents())
            return soup

    def postprocess_html(self, soup, first_fetch):
        # convert lists of author(s) and date(s) into simple text
        for cap in soup.findAll('div', {'class':re.compile('.*vhb-article-caption')}):
            cap.replaceWith(cap.renderContents())
        for row in soup.findAll('div', {'class':'vhb-article-author-row'}):
            for ul in row.findAll('ul'):
                entry = ''
                for li in ul.findAll(lambda tag: tag.name == 'li' and not tag.attrs):
                    entry = entry + li.renderContents() + ', '
                for li in ul.findAll(lambda tag: tag.name == 'li' and tag.attrs):
                    entry = entry + li.renderContents() + '<br/>'
                ul.parent.replaceWith(entry)
        # make sure that all figure captions (including the source) are shown
        # without linebreaks by using the alternative text given within <img/>
        # instead of the original text (which is oddly formatted)
        for fig in soup.findAll('figure', {'class':'vhb-image'}):
            fig.find('div', {'class':'vhb-caption'}).replaceWith(fig.find('img')['alt'])
        return soup
Aimylios is offline   Reply With Quote
Old 12-01-2017, 03:05 AM   #4
FLR
Junior Member
FLR began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Dec 2017
Device: Kindle Oasis
Handelsblatt Recipe broke for premium subscribers?

Hello,

it seems to me that the download of Handelsblatt premium articles is broken at the moment. I have a valid user account but the premium articles do not show up in the downloaded mobi-file. The free articles work just fine.

I tried this on Calibre for Windows and over the command line interface on linux.

On Linux I see alot of "Failed to download article:" and "Could not fetch link" messages while downloading which seem to refer to the paid articles. In the browser those articles are accesible after login and also part of the epaper.

On Linux I use

ebook-convert "handelsblatt.recipe" [destination].mobi --output-profile kindle --username ***** --password *****

for this.

Is it possible that there is something wrong with the login of the recipe at the moment?

Thank you very much for any help!
FLR is offline   Reply With Quote
Old 12-02-2017, 10:36 AM   #5
Aimylios
Member
Aimylios began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
Hi FLR,

unfortunately I don't have a subscription any more and only use it for free articles. At least the login URL is still valid.
If you want, you can send me your login credentials via PM and I'll try to fix it.
Aimylios is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Re-Write for Handelsblatt Recipe (German Business Newspaper) hegi Recipes 1 02-14-2015 07:40 AM
Recipe for EPUB subscribers of "Tagesspiegel" and "Handelsblatt"? F.W. Recipes 0 05-14-2013 11:16 AM
Handelsblatt malfi Recipes 14 04-23-2011 01:33 PM
Handelsblatt recipe no longer works Dereks Recipes 1 03-20-2011 07:22 PM
Recipe Request for Handelsblatt [GER] Moik Recipes 6 10-15-2010 07:13 PM


All times are GMT -4. The time now is 08:36 AM.


MobileRead.com is a privately owned, operated and funded community.