04-03-2016, 04:56 PM | #1 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Handelsblatt recipe
Hi,
the current Handelsblatt recipe is broken for quite some time. When trying to fix it I found out that the article structure has changed fundamentally, so I had to write a completely new one. It would of course be great if the old handelsblatt.recipe could be replaced by this new code. Code:
#!/usr/bin/env python2 __license__ = 'GPL v3' __copyright__ = '2016, Aimylios' ''' handelsblatt.com ''' import re from calibre.web.feeds.news import BasicNewsRecipe class Handelsblatt(BasicNewsRecipe): title = u'Handelsblatt' __author__ = 'Aimylios' # based on the work of malfi and Hegi description = u'RSS-Feeds von Handelsblatt.com' publisher = 'Verlagsgruppe Handelsblatt GmbH' category = 'news, politics, business, economy, Germany' publication_type = 'newspaper' language = 'de' encoding = 'utf8' oldest_article = 4 max_articles_per_feed = 30 simultaneous_downloads = 20 no_stylesheets = True remove_javascript = True remove_empty_feeds = True ignore_duplicate_articles = {'title', 'url'} conversion_options = {'smarten_punctuation' : True, 'publisher' : publisher} # uncomment this to reduce file size # compress_news_images = True cover_source = 'https://kaufhaus.handelsblatt.com/downloads/handelsblatt-epaper-p1951.html' masthead_url = 'http://www.handelsblatt.com/images/logo_handelsblatt/11002806/7-formatOriginal.png' #masthead_url = 'http://www.handelsblatt.com/images/hb_logo/6543086/1-format3.jpg' #masthead_url = 'http://www.handelsblatt-chemie.de/wp-content/uploads/2012/01/hb-logo.gif' feeds = [ (u'Top-Themen', u'http://www.handelsblatt.com/contentexport/feed/top-themen'), (u'Politik', u'http://www.handelsblatt.com/contentexport/feed/politik'), (u'Unternehmen', u'http://www.handelsblatt.com/contentexport/feed/unternehmen'), (u'Finanzen', u'http://www.handelsblatt.com/contentexport/feed/finanzen'), (u'Technologie', u'http://www.handelsblatt.com/contentexport/feed/technologie'), (u'Panorama', u'http://www.handelsblatt.com/contentexport/feed/panorama'), (u'Sport', u'http://www.handelsblatt.com/contentexport/feed/sport') ] keep_only_tags = [ dict(name='div', attrs={'class':['vhb-article-container']}) ] remove_tags = [ dict(name='span', attrs={'class':['vhb-media', 'vhb-colon']}), dict(name='small', attrs={'class':['vhb-credit']}), dict(name='aside', attrs={'class':['vhb-article-element vhb-left', 'vhb-article-element vhb-left vhb-teasergallery', 'vhb-article-element vhb-left vhb-shorttexts']}), dict(name='article', attrs={'class':['vhb-imagegallery vhb-teaser', 'vhb-teaser vhb-type-video']}), dict(name='div', attrs={'class':['fb-post']}), dict(name='blockquote', attrs={'class':['twitter-tweet']}), dict(name='a', attrs={'class':['twitter-follow-button']}) ] preprocess_regexps = [ # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span> (re.compile(r'(<span class="hcf-location-mark">[^<]+)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)), # Insert ": " after "Title" in <em itemtype="text" itemprop="name" class="vhb-title">Title</em> (re.compile(r'(<em itemtype="text" itemprop="name" class="vhb-title">[^<]+)(</em>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2)) ] extra_css = 'h2 {text-align: left} \ h3 {font-size: 1em; text-align: left} \ h4 {font-size: 1em; text-align: left; margin-bottom: 0em} \ em {font-style: normal; font-weight: bold} \ .vhb-subline {font-size: 0.6em; text-transform: uppercase} \ .vhb-article-caption {float: left; padding-right: 0.2em} \ .vhb-article-author-cell ul {list-style-type: none; margin: 0em} \ .vhb-teaser-head {margin-top: 1em; margin-bottom: 1em} \ .vhb-caption-wrapper {font-size: 0.6em} \ .hcf-location-mark {font-weight: bold} \ .panel-link {color: black; text-decoration: none} \ .panel-body p {margin-top: 0em}' def get_cover_url(self): soup = self.index_to_soup(self.cover_source) style = soup.find('img', alt='Handelsblatt ePaper', style=True)['style'] self.cover_url = style.partition('(')[-1].rpartition(')')[0] return self.cover_url def print_version(self, url): main, sep, id = url.rpartition('/') return main + '/v_detail_tab_print/' + id def preprocess_html(self, soup): # remove all articles without relevant content (e.g., videos) article_container = soup.find('div', {'class':'vhb-article-container'}) if article_container == None: self.abort_article() else: return soup def postprocess_html(self, soup, first_fetch): # make sure that all figure captions (including the source) are shown # without linebreaks by using the alternative text given within <img/> # instead of the original text (which is oddly formatted) article_figures = soup.findAll('figure', {'class':'vhb-image'}) for fig in article_figures: fig.find('div', {'class':'vhb-caption'}).replaceWith(fig.find('img')['alt']) return soup |
04-03-2016, 09:31 PM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
done, thanks.
|
Advert | |
|
04-11-2016, 01:07 PM | #3 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Hi Kovid,
thanks for integrating my recipe into Calibre! In the meantime I have been working on an improved version, please find the new code below. Changelog:
Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals __license__ = 'GPL v3' __copyright__ = '2016, Aimylios' ''' handelsblatt.com ''' import re from calibre.web.feeds.news import BasicNewsRecipe class Handelsblatt(BasicNewsRecipe): title = 'Handelsblatt' __author__ = 'Aimylios' # based on the work of malfi and Hegi description = 'RSS-Feeds von Handelsblatt.com' publisher = 'Verlagsgruppe Handelsblatt GmbH' publication_type = 'newspaper' needs_subscription = 'optional' language = 'de' encoding = 'utf-8' oldest_article = 4 max_articles_per_feed = 30 simultaneous_downloads = 20 no_stylesheets = True remove_javascript = True remove_empty_feeds = True ignore_duplicate_articles = {'title', 'url'} conversion_options = {'smarten_punctuation' : True, 'publisher' : publisher} # uncomment this to reduce file size #compress_news_images = True #compress_news_images_max_size = 16 cover_source = 'https://kaufhaus.handelsblatt.com/downloads/handelsblatt-epaper-p1951.html' masthead_url = 'http://www.handelsblatt.com/images/logo_handelsblatt/11002806/7-formatOriginal.png' feeds = [ ('Top-Themen', 'http://www.handelsblatt.com/contentexport/feed/top-themen'), ('Politik', 'http://www.handelsblatt.com/contentexport/feed/politik'), ('Unternehmen', 'http://www.handelsblatt.com/contentexport/feed/unternehmen'), ('Finanzen', 'http://www.handelsblatt.com/contentexport/feed/finanzen'), ('Technologie', 'http://www.handelsblatt.com/contentexport/feed/technologie'), ('Panorama', 'http://www.handelsblatt.com/contentexport/feed/panorama'), ('Sport', 'http://www.handelsblatt.com/contentexport/feed/sport') ] keep_only_tags = [dict(name='div', attrs={'class':['vhb-article-container']})] remove_tags = [ dict(name='span', attrs={'class':['vhb-colon', 'vhb-label-premium']}), dict(name='aside', attrs={'class':['vhb-article-element vhb-left', 'vhb-article-element vhb-left vhb-teasergallery', 'vhb-article-element vhb-left vhb-shorttexts']}), dict(name='article', attrs={'class':['vhb-imagegallery vhb-teaser', 'vhb-teaser vhb-type-video']}), dict(name='small', attrs={'class':['vhb-credit']}), dict(name='div', attrs={'class':['white_content', 'fb-post']}), dict(name='a', attrs={'class':['twitter-follow-button']}), dict(name='blockquote') ] preprocess_regexps = [ # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span> (re.compile(r'(<span class="hcf-location-mark">[^<]+)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)), # Insert ": " after "Title" in <em itemtype="text" itemprop="name" class="vhb-title">Title</em> (re.compile(r'(<em itemtype="text" itemprop="name" class="vhb-title">[^<]+)(</em>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2)) ] extra_css = 'h2 {text-align: left} \ h3 {font-size: 1em; text-align: left} \ h4 {font-size: 1em; text-align: left; margin-bottom: 0em} \ em {font-style: normal; font-weight: bold} \ .vhb-subline {font-size: 0.6em; text-transform: uppercase} \ .vhb-teaser-head {margin-top: 1em; margin-bottom: 1em} \ .vhb-caption-wrapper {font-size: 0.6em} \ .hcf-location-mark {font-weight: bold} \ .panel-body p {margin-top: 0em}' def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('https://profil.vhb.de/sso/login?service=http://www.handelsblatt.com') br.select_form(nr=0) br['username'] = self.username br['password'] = self.password br.submit() return br def get_cover_url(self): soup = self.index_to_soup(self.cover_source) style = soup.find('img', alt='Handelsblatt ePaper', style=True)['style'] self.cover_url = style.partition('(')[-1].rpartition(')')[0] return self.cover_url def print_version(self, url): main, sep, id = url.rpartition('/') return main + '/v_detail_tab_print/' + id def preprocess_html(self, soup): # remove all articles without relevant content (e.g., videos) article_container = soup.find('div', {'class':'vhb-article-container'}) if article_container is None: self.abort_article() else: # remove all local hyperlinks for a in soup.findAll('a', {'href':True}): if a['href'] and a['href'][0] in ['/', '#']: a.replaceWith(a.renderContents()) return soup def postprocess_html(self, soup, first_fetch): # convert lists of author(s) and date(s) into simple text for cap in soup.findAll('div', {'class':re.compile('.*vhb-article-caption')}): cap.replaceWith(cap.renderContents()) for row in soup.findAll('div', {'class':'vhb-article-author-row'}): for ul in row.findAll('ul'): entry = '' for li in ul.findAll(lambda tag: tag.name == 'li' and not tag.attrs): entry = entry + li.renderContents() + ', ' for li in ul.findAll(lambda tag: tag.name == 'li' and tag.attrs): entry = entry + li.renderContents() + '<br/>' ul.parent.replaceWith(entry) # make sure that all figure captions (including the source) are shown # without linebreaks by using the alternative text given within <img/> # instead of the original text (which is oddly formatted) for fig in soup.findAll('figure', {'class':'vhb-image'}): fig.find('div', {'class':'vhb-caption'}).replaceWith(fig.find('img')['alt']) return soup |
12-01-2017, 03:05 AM | #4 |
Junior Member
Posts: 1
Karma: 10
Join Date: Dec 2017
Device: Kindle Oasis
|
Handelsblatt Recipe broke for premium subscribers?
Hello,
it seems to me that the download of Handelsblatt premium articles is broken at the moment. I have a valid user account but the premium articles do not show up in the downloaded mobi-file. The free articles work just fine. I tried this on Calibre for Windows and over the command line interface on linux. On Linux I see alot of "Failed to download article:" and "Could not fetch link" messages while downloading which seem to refer to the paid articles. In the browser those articles are accesible after login and also part of the epaper. On Linux I use ebook-convert "handelsblatt.recipe" [destination].mobi --output-profile kindle --username ***** --password ***** for this. Is it possible that there is something wrong with the login of the recipe at the moment? Thank you very much for any help! |
12-02-2017, 10:36 AM | #5 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Hi FLR,
unfortunately I don't have a subscription any more and only use it for free articles. At least the login URL is still valid. If you want, you can send me your login credentials via PM and I'll try to fix it. |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Re-Write for Handelsblatt Recipe (German Business Newspaper) | hegi | Recipes | 1 | 02-14-2015 07:40 AM |
Recipe for EPUB subscribers of "Tagesspiegel" and "Handelsblatt"? | F.W. | Recipes | 0 | 05-14-2013 11:16 AM |
Handelsblatt | malfi | Recipes | 14 | 04-23-2011 01:33 PM |
Handelsblatt recipe no longer works | Dereks | Recipes | 1 | 03-20-2011 07:22 PM |
Recipe Request for Handelsblatt [GER] | Moik | Recipes | 6 | 10-15-2010 07:13 PM |