Re-Write for Handelsblatt Recipe (German Business Newspaper)

hegi · 05-20-2013, 03:45 PM

Hi Kovid,

since Wirtschaftswoche und Handelsblatt are from the same publisher, they use the same CMS. - Here's an improved Handelsblatt Recipe based on my work for the Wirtschaftswoche:

Code:

import re
from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt'
    __author__ = 'malfi' ### modified by Hegi, last change 2013-05-20
    description           = u'Handelsblatt - basierend auf den RRS-Feeds von Handelsblatt.de'
    tags 	                = 'Nachrichten, Blog, Wirtschaft'
    publisher             = 'Verlagsgruppe Handelsblatt GmbH'
    category              = 'business, economy, news, Germany' 
    publication_type      = 'daily newspaper'
    language              = 'de_DE'
    oldest_article        = 7
    max_articles_per_feed = 100
    simultaneous_downloads= 20

    auto_cleanup          = False
    no_stylesheets        = True
    remove_javascript     = True
    remove_empty_feeds    = True

    # don't duplicate articles from "Schlagzeilen" / "Exklusiv" to other rubrics 
    ignore_duplicate_articles = {'title', 'url'}

    # if you want to reduce size for an b/w or E-ink device, uncomment this:
    #compress_news_images  = True
    #compress_news_images_auto_size = 16
    #scale_news_images     = (400,300)

    timefmt               = ' [%a, %d %b %Y]'

    conversion_options    = { 'smarten_punctuation' : True,
			'authors'		  : publisher,
			'publisher'  	  : publisher }
    language              = 'de_DE'
    encoding              = 'UTF-8'
    
    cover_source          = 'http://www.handelsblatt-shop.com/epaper/482/'
    #masthead_url          = 'http://www.handelsblatt.com/images/hb_logo/6543086/1-format3.jpg'
    masthead_url          = 'http://www.handelsblatt-chemie.de/wp-content/uploads/2012/01/hb-logo.gif'

    def get_cover_url(self):
       cover_source_soup = self.index_to_soup(self.cover_source)
       preview_image_div = cover_source_soup.find(attrs={'class':'vorschau'})
       return 'http://www.handelsblatt-shop.com'+preview_image_div.a.img['src']

    #remove_tags_before =  dict(attrs={'class':'hcf-overline'})
    #remove_tags_after  =  dict(attrs={'class':'hcf-footer'})
    # Alternatively use this:

    keep_only_tags    = [
                          dict(name='div', attrs={'class':['hcf-column hcf-column1 hcf-teasercontainer hcf-maincol']}),
                          dict(name='div', attrs={'id':['contentMain']})
                        ]

    remove_tags = [
                    dict(name='div', attrs={'class':['hcf-link-block hcf-faq-open', 'hcf-article-related']})
                  ]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]

    # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
    # If you use .epub format you could also do this as extra_css '.hcf-location-mark:after {content: ". "}' 
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]

    extra_css      =  'h1 {font-size: 1.6em; text-align: left} \
                       h2 {font-size: 1em; font-style: italic; font-weight: normal} \
                       h3 {font-size: 1.3em;text-align: left} \
                       h4, h5, h6, a {font-size: 1em;text-align: left} \
                       .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
	             .hcf-location-mark {font-style: italic}' 

    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

All the best from good ole Germany!

Hegi.

nicolaus · 02-14-2015, 08:40 AM

Dear Hegi

Since a couple of days there seems to be an issue with the Handelsblatt recipe. The download runs through smoothly without errors but when I open them (on a Kindle Paperwhite), I only see the table of contents. When I click an article, it jumps to a table of contents page and if I scroll through the whole file page by page, I only see table of content pages for the different feeds. A download today returned only a 29 page document. It looks the same in the calibre viewer and I have tried downloading on Windows and Linux -- same result on both. Oddly enough, article previews are displayed in the Kindle's Four-Feed-Summary page (sorry, don't know what it's called -- the button right of the light bulb button).

Have you encountered this issue, too? (Or even better: Do you have a solution, yet?

)

Many thanks

Nicolaus

02-14-2015, 08:40 AM	#2
nicolaus Junior Member Posts: 4 Karma: 10 Join Date: Nov 2014 Device: Kindle Paperwhite	Handelsblatt ebook now only contains table of contents Dear Hegi Since a couple of days there seems to be an issue with the Handelsblatt recipe. The download runs through smoothly without errors but when I open them (on a Kindle Paperwhite), I only see the table of contents. When I click an article, it jumps to a table of contents page and if I scroll through the whole file page by page, I only see table of content pages for the different feeds. A download today returned only a 29 page document. It looks the same in the calibre viewer and I have tried downloading on Windows and Linux -- same result on both. Oddly enough, article previews are displayed in the Kindle's Four-Feed-Summary page (sorry, don't know what it's called -- the button right of the light bulb button). Have you encountered this issue, too? (Or even better: Do you have a solution, yet? ) Many thanks Nicolaus

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe for Wirtschaftswoche / Wiwo.de (German Business Weekly)	hegi	Recipes	47	08-08-2022 09:28 AM
Recipe for german newspaper "Berliner Zeitung"	a.peter	Recipes	1	12-13-2011 04:02 PM
Handelsblatt recipe no longer works	Dereks	Recipes	1	03-20-2011 08:22 PM
Recipe Request for Handelsblatt [GER]	Moik	Recipes	6	10-15-2010 08:13 PM
108-year-old Newspaper Journal Out of Business	Moejoe	News	0	12-13-2009 01:24 AM

Advert