View Single Post
Old 04-07-2013, 11:29 AM   #1
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Recipe for Wirtschaftswoche / Wiwo.de (German Business Weekly)

HiHo,

took the time to build a recipe for German Wirtschaftswoche based on Malfi's Handelsblatt recipe. - It's already very usable, though I still have two things I'd like to optimize. I hope you guys can help.

Let's start with the Recipe "as is" first:
Code:
##
## Title:        Wirtschaftswoche Online - wiwo.de
## Contact:      Hegi - hegi@teleos-web.de
##
## License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
## Copyright:    Hegi - hegi@teleos-web.de
## Based on:     "Handelsblatt" Recipe by malfi with ideas form the "BBC" Recipe by mattst / Thanks for these examples!
##
## Written:      April 2013
## Last Edited:  2013-04-07
##


from calibre.web.feeds.news import BasicNewsRecipe


class Wirtschaftswoche(BasicNewsRecipe):
    title            = u'Wirtschaftswoche - WiWo.de'
    description      = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
    cover_url        = 'http://upload.wikimedia.org/wikipedia/de/thumb/b/b9/Wirtschaftswoche-Logo.svg/641px-Wirtschaftswoche-Logo.svg.png'
    tags 	           = 'Nachrichten, Blog, Wirtschaft'
    publisher        = 'Verlagsgruppe Handelsblatt'
    publication_type = 'newspaper'


    __author__       = 'Hegi'
    __license__      = 'GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html'
    __copyright__    = 'Hegi - hegi@teleos-web.de'
 
    simultaneous_downloads = 20
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    language = 'de_DE'
 
    remove_empty_feeds = True
    ignore_duplicate_articles = {'title', 'url'}
    compress_news_images_auto_size = 16
    conversion_options = { 'title'       : title,
                           'comments'    : description,
                           'tags'        : tags,
                           'language'    : language,
                           'publisher'   : publisher,
                           'authors'     : publisher,
                           'smarten_punctuation' : True
                         }


    remove_tags_before =  dict(attrs={'class':'hcf-overline'})
    #remove_tags_after  =  dict(attrs={'class':'hcf-footer'})
    remove_tags_after  =  dict(attrs={'class':'hcf-meta-nav'})


    feeds          = [
	(u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
	(u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'), 
	(u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
	(u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
	(u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
	(u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
	(u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
	(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/')
    ]


    extra_css = 'h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} \
                 h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} \
                 p{font-family:Arial,Helvetica,sans-serif;font-size:small;} \
                 body{font-family:Helvetica,Arial,sans-serif;font-size:small;}'      


    def print_version(self, url):
        url = url.split('/')
        url[-1] = 'v_detail_tab_print,'+url[-1]
        url = '/'.join(url)
        return url
Now let's come to the tricky bits:

1. When an article starts with a "place", the source html looks as follows:

Code:
<span class="hcf-location-mark">New York</span>
However, on the WiWo.de Website, this is shown as "New York. " (with the tailing dot and space). How can I incorporate this in my recipe? Is this possible with extra_css or with RegEx?

2. The end of the article text looks in html like this:

Code:
[...]<div id="hcf-footer"><div class="hcf-copyright">
 <div class="hcf-copyright-inner">
 &copy; 2011 Handelsblatt GmbH - ein Unternehmen der Verlagsgruppe Handelsblatt GmbH &amp; Co. KG
 </div>
</div>
<div class="hcf-meta-nav"> [...]
No matter whether I use "remove_tags_after = dict(attrs={'id':'hcf-footer'})" or "remove_tags_after = dict(attrs={'class':'hcf-meta-nav'})", I still get the Service-Nav at the end of each article. Do I misunderstand the option or could this be realated to the fact, that I have only Calibre 0.8.51 installed?

Thanks a lot for your help. - And hope the recipe is useful for others, too.

Hegi.
hegi is offline   Reply With Quote