Recipe for Wirtschaftswoche / Wiwo.de (German Business Weekly)

hegi · 04-07-2013, 11:29 AM

HiHo,

took the time to build a recipe for German Wirtschaftswoche based on Malfi's Handelsblatt recipe. - It's already very usable, though I still have two things I'd like to optimize. I hope you guys can help.

Let's start with the Recipe "as is" first:

Code:

##
## Title:        Wirtschaftswoche Online - wiwo.de
## Contact:      Hegi - hegi@teleos-web.de
##
## License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
## Copyright:    Hegi - hegi@teleos-web.de
## Based on:     "Handelsblatt" Recipe by malfi with ideas form the "BBC" Recipe by mattst / Thanks for these examples!
##
## Written:      April 2013
## Last Edited:  2013-04-07
##


from calibre.web.feeds.news import BasicNewsRecipe


class Wirtschaftswoche(BasicNewsRecipe):
    title            = u'Wirtschaftswoche - WiWo.de'
    description      = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
    cover_url        = 'http://upload.wikimedia.org/wikipedia/de/thumb/b/b9/Wirtschaftswoche-Logo.svg/641px-Wirtschaftswoche-Logo.svg.png'
    tags 	           = 'Nachrichten, Blog, Wirtschaft'
    publisher        = 'Verlagsgruppe Handelsblatt'
    publication_type = 'newspaper'


    __author__       = 'Hegi'
    __license__      = 'GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html'
    __copyright__    = 'Hegi - hegi@teleos-web.de'
 
    simultaneous_downloads = 20
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    language = 'de_DE'
 
    remove_empty_feeds = True
    ignore_duplicate_articles = {'title', 'url'}
    compress_news_images_auto_size = 16
    conversion_options = { 'title'       : title,
                           'comments'    : description,
                           'tags'        : tags,
                           'language'    : language,
                           'publisher'   : publisher,
                           'authors'     : publisher,
                           'smarten_punctuation' : True
                         }


    remove_tags_before =  dict(attrs={'class':'hcf-overline'})
    #remove_tags_after  =  dict(attrs={'class':'hcf-footer'})
    remove_tags_after  =  dict(attrs={'class':'hcf-meta-nav'})


    feeds          = [
	(u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
	(u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'), 
	(u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
	(u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
	(u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
	(u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
	(u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
	(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/')
    ]


    extra_css = 'h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} \
                 h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} \
                 p{font-family:Arial,Helvetica,sans-serif;font-size:small;} \
                 body{font-family:Helvetica,Arial,sans-serif;font-size:small;}'      


    def print_version(self, url):
        url = url.split('/')
        url[-1] = 'v_detail_tab_print,'+url[-1]
        url = '/'.join(url)
        return url

Now let's come to the tricky bits:

1. When an article starts with a "place", the source html looks as follows:

Code:

<span class="hcf-location-mark">New York</span>

However, on the WiWo.de Website, this is shown as "New York. " (with the tailing dot and space). How can I incorporate this in my recipe? Is this possible with extra_css or with RegEx?

2. The end of the article text looks in html like this:

Code:

[...]<div id="hcf-footer"><div class="hcf-copyright">
 <div class="hcf-copyright-inner">
 &copy; 2011 Handelsblatt GmbH - ein Unternehmen der Verlagsgruppe Handelsblatt GmbH &amp; Co. KG
 </div>
</div>
<div class="hcf-meta-nav"> [...]

No matter whether I use "remove_tags_after = dict(attrs={'id':'hcf-footer'})" or "remove_tags_after = dict(attrs={'class':'hcf-meta-nav'})", I still get the Service-Nav at the end of each article. Do I misunderstand the option or could this be realated to the fact, that I have only Calibre 0.8.51 installed?

Thanks a lot for your help. - And hope the recipe is useful for others, too.

Hegi.

kovidgoyal · 04-07-2013, 11:52 AM

You need to update your version of calibre first. Then just add the following extra css to the recipe

Code:

extra_css = '''
.hcf-location-mark:after {
    content: ". "
}
.hcf_location-mark {
    font-style: italic
}
'''

Support for :before and :after pseudo selectors in calibre is very recent, so you *must* upgrade.

As for the second, there is likely something you are missing, all the best tracking it down

hegi · 04-18-2013, 02:12 PM

Hi Kovid,

thanks for you quick reply. My life is a bit crazy these days, so it took me longer to get back to you. AND - I tried quite a few things in the meantime.

Nevertheless I'm still hanging with the :after CSS tag.

Currently my extra_css looks like this:

Code:

    extra_css      =  'h1 {font-size: 1.6em; text-align: left} \
                       h2 {font-size: 1em; font-style: italic; font-weight: normal} \
                       h3 {font-size: 1.3em;text-align: left} \
                       h4, h5, h6, a {font-size: 1em;text-align: left} \
                       .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
  	             .hcf-location-mark:after {content: ". "} \
	             .hcf-location-mark {font-style: italic}'

The Italics from the last line work. The insertion of the ". " doesn't, despite an upgrade to Calibre 0.9.25 - According to the changelog the before/after tags were fixed in 0.9.24. - This is strange. Did I mess up anything else here?

It also says in the changelog, that as of 0.9.24 it is possible to "reduce the size of downloaded images by lowering their quality". I assume this refers to the options "compress_news_images_max_size" and "compress_news_images_auto_size". - But it doesn't appear to have a significant effect. Very strange!

I'm running calibre in an ia32 chroot on an debian amd64 system. But all seems fine:

Code:

[$ calibre --version
calibre (calibre 0.9.25)

Can I get more information from the gui when running the recipe, or do I have to run ebook-convert on the command-line for more debugging information?

Last question: When the recipe is running satisfactory, then it's here the place to post the final version, correct?

Thanks a lot!

Hegi.

hegi · 04-18-2013, 02:57 PM

Kovid,
...me again.

This is *really* strange: Why do I get completely different behaviour / output when I run the recipe from the cli with ebook-convert than when I run it from calibre gui with "download now"?

On the cli things work much neater than form the gui. (E.g. from cli the css works with :after tag, publisher tag is used - instead saying just "calibre"). - This is weired.

I think, I'm just going bananas.

Hegi.

kovidgoyal · 04-20-2013, 12:00 AM

Presumably because you are running different versions of calibre.

hegi · 04-21-2013, 01:36 PM

Hi Kovid,

... so, did a complete clean new install of 0.9.27 using the official binaries (amd64) from your website and the python installer, uninstalled the version in the chroot and now there should be a clean an actual calibre environment.

What I notice is the following:
- whether the :after CSS is working or not depends on the selected output format. In the gui options I have ".mobi" as preferred output format (in order to email that automatically to my kindle pw). Previously I made an .epub form the cli. Now I changed that to ".mobi" as well. RESULT: If the Output format is .mobi, the :after CSS does not work, if it is .epub it does. - Could this possibly be a buggy behaviour?

- The other differences in output seem to be related to to format as well. - When creating .epub I get a Header (Menu buttons) and Footer ("downloaded by calibre ...", Menu buttons).

So the real issue seems to be, why CSS :after does not work with .mobi format.

I would be delighted, if this hint helps to discover a bug.

Thanks

Hegi.

kovidgoyal · 04-21-2013, 10:59 PM

The MOBI format has no support for CSS. You must use either epub or azw3, but not that amazon does not support periodicals in the azw3 format.

Divingduck · 04-23-2013, 02:40 AM

@hegi, if you are develpoing a general recipe for a wide range of readers you need to be carefull with predefined formats. Use as less as possible. You will find these differences between devices and formats.

hegi · 04-23-2013, 03:17 PM

@Divingduck: Stay cool and calm. - I work from two ends: Firstly, I want to make the recipe work with general options. Secondly, I want to optimize for my own device. The options for the latter bit can then be commented an whoever likes them, can switch them back on. - All will be well!

However, the deeper I dig into this, the more complicated things seem to get. And I'm really busy these days, so things progress *very* slowly.

Hegi.

hegi · 04-29-2013, 02:39 PM

Hiho,

... still optimizing ... and still going a bit crazy, since I have only superficial programming skills

.

Now, if the clever CSS hint from Kovid won't work for .mobi format, I ask myself, if I could not achieve the same using preprocess_html. What I get as input form the webiste is:

Code:

<span class="hcf-location-mark">Place</span>

In order to add a ". " after "Place" can't I do something like:

Code:

    def preprocess_html(self, soup):
        for location in soup.find('span', attrs={'class':'hcf-location-mark'}):
                newloc = location.string +". "
                location.replaceWith(newloc)
        return soup

This is "reverse-engineering" from other recipes. So please don't hit me if the syntax is a bit foolish, OK?

- But I coudn't find this kind of "search and replace" expample elsewhere yet.

Thanks.

Hegi.

hegi · 05-18-2013, 06:11 AM

Hey Folks,

I seem to be getting nowhere with my limited tries with preprocess_html. The results are strange and I'm having my difficulties to get to grips with the beatiful soup documentation.

Nevertheless, can't I do the trick possibly more easily with preprocess_regexps?

My current status is as follows:

Code:

preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">.+) (</span>)', re.DOTALL|re.IGNORECASE), lambda match: "\1'. '\2")]

But as a result I don't see any change in the output. Could it be, that the braketing of the RegExp Parts and the referencing with \1 or \2 does not work in this case?

I found some useful expamples for preprocess_regexps here, however I havn't found a way documented to include the match form the search in the replace part.

Many thanks in advance for any useful hints in this matter.

Hegi.

hegi · 05-19-2013, 11:16 AM

... me again!

finally got it working. Here the Regex code, that does the trick:

Code:

preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]

... just to have this documented here.

Hegi.

hegi · 05-19-2013, 11:30 AM

Hi Folks,

... after a couple of weeks fiddling about, here my "production quality" recipe for WirtschaftsWoche Online. - Enjoy

.

The template I began with is from Divingduck and I got his clearance for posting my modified version here:

Code:

__license__   = 'GPL v3'
__copyright__ = '2013, Armin Geller'

'''
Fetch WirtschaftsWoche Online
'''
import re
#import time
from calibre.web.feeds.news import BasicNewsRecipe
class WirtschaftsWocheOnline(BasicNewsRecipe):
    title                 = u'WirtschaftsWoche Online'
    __author__            = 'Armin Geller' # Update AGE 2013-01-05; Modified by Hegi 2013-04-28
    description           = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
    tags 	                = 'Nachrichten, Blog, Wirtschaft'
    publisher             = 'Verlagsgruppe Handelsblatt GmbH / Redaktion WirtschaftsWoche Online'
    category              = 'business, economy, news, Germany' 
    publication_type      = 'weekly magazine'
    language              = 'de_DE'
    oldest_article        = 7
    max_articles_per_feed = 100
    simultaneous_downloads= 20
    
    auto_cleanup          = False
    no_stylesheets        = True
    remove_javascript     = True
    remove_empty_feeds    = True

    # don't duplicate articles from "Schlagzeilen" / "Exklusiv" to other rubrics 
    ignore_duplicate_articles = {'title', 'url'}

    # if you want to reduce size for an b/w or E-ink device, uncomment this:
    # compress_news_images  = True
    # compress_news_images_auto_size = 16
    # scale_news_images     = (400,300)
    
    timefmt               = ' [%a, %d %b %Y]'

    conversion_options    = { 'smarten_punctuation' : True,
			'authors'		  : publisher,
			'publisher'  	  : publisher }
    language              = 'de_DE'
    encoding              = 'UTF-8'
    cover_source          = 'http://www.wiwo-shop.de/wirtschaftswoche/wirtschaftswoche-emagazin-p1952.html'
    masthead_url          = 'http://www.wiwo.de/images/wiwo_logo/5748610/1-formatOriginal.png'

    def get_cover_url(self):
       cover_source_soup = self.index_to_soup(self.cover_source)
       preview_image_div = cover_source_soup.find(attrs={'class':'container vorschau'})
       return 'http://www.wiwo-shop.de'+preview_image_div.a.img['src']

    # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
    # If you use .epub format you could also do this as extra_css '.hcf-location-mark:after {content: ". "}' 
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]

    extra_css      =  'h1 {font-size: 1.6em; text-align: left} \
                       h2 {font-size: 1em; font-style: italic; font-weight: normal} \
                       h3 {font-size: 1.3em;text-align: left} \
                       h4, h5, h6, a {font-size: 1em;text-align: left} \
                       .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
	             .hcf-location-mark {font-style: italic}' 

    keep_only_tags    = [
                          dict(name='div', attrs={'class':['hcf-column hcf-column1 hcf-teasercontainer hcf-maincol']}),
                          dict(name='div', attrs={'id':['contentMain']})
                        ]

    remove_tags = [
                    dict(name='div', attrs={'class':['hcf-link-block hcf-faq-open', 'hcf-article-related']})
                  ]

    feeds = [
              (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
              (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
              #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version
              (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
              (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
              (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
              (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
              (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
              #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version
            ]
    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id

Kovid, it would be great, if you could include this in one of the next releases.

Thanks to all who helped me getting there

!

Hegi.

kovidgoyal · 05-19-2013, 11:47 AM

http://bazaar.launchpad.net/~kovid/c...revision/15047

hegi · 05-19-2013, 01:37 PM

Thanks Kovid,

that was really quick!

Hegi.

04-07-2013, 11:52 AM	#2
kovidgoyal creator of calibre Posts: 46,410 Karma: 29634066 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You need to update your version of calibre first. Then just add the following extra css to the recipe Code: extra_css = ''' .hcf-location-mark:after { content: ". " } .hcf_location-mark { font-style: italic } ''' Support for :before and :after pseudo selectors in calibre is very recent, so you must upgrade. As for the second, there is likely something you are missing, all the best tracking it down

04-18-2013, 02:12 PM	#3
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	Pseudo CSS :after Hi Kovid, thanks for you quick reply. My life is a bit crazy these days, so it took me longer to get back to you. AND - I tried quite a few things in the meantime. Nevertheless I'm still hanging with the :after CSS tag. Currently my extra_css looks like this: Code: extra_css = 'h1 {font-size: 1.6em; text-align: left} \ h2 {font-size: 1em; font-style: italic; font-weight: normal} \ h3 {font-size: 1.3em;text-align: left} \ h4, h5, h6, a {font-size: 1em;text-align: left} \ .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \ .hcf-location-mark:after {content: ". "} \ .hcf-location-mark {font-style: italic}' The Italics from the last line work. The insertion of the ". " doesn't, despite an upgrade to Calibre 0.9.25 - According to the changelog the before/after tags were fixed in 0.9.24. - This is strange. Did I mess up anything else here? It also says in the changelog, that as of 0.9.24 it is possible to "reduce the size of downloaded images by lowering their quality". I assume this refers to the options "compress_news_images_max_size" and "compress_news_images_auto_size". - But it doesn't appear to have a significant effect. Very strange! I'm running calibre in an ia32 chroot on an debian amd64 system. But all seems fine: Code: [$ calibre --version calibre (calibre 0.9.25) Can I get more information from the gui when running the recipe, or do I have to run ebook-convert on the command-line for more debugging information? Last question: When the recipe is running satisfactory, then it's here the place to post the final version, correct? Thanks a lot! Hegi.

04-29-2013, 02:39 PM	#10
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	preprocess_html instead of extra_css Hiho, ... still optimizing ... and still going a bit crazy, since I have only superficial programming skills. Now, if the clever CSS hint from Kovid won't work for .mobi format, I ask myself, if I could not achieve the same using preprocess_html. What I get as input form the webiste is: Code: <span class="hcf-location-mark">Place</span> In order to add a ". " after "Place" can't I do something like: Code: def preprocess_html(self, soup): for location in soup.find('span', attrs={'class':'hcf-location-mark'}): newloc = location.string +". " location.replaceWith(newloc) return soup This is "reverse-engineering" from other recipes. So please don't hit me if the syntax is a bit foolish, OK? - But I coudn't find this kind of "search and replace" expample elsewhere yet. Thanks. Hegi.

05-18-2013, 06:11 AM	#11
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	Hey Folks, I seem to be getting nowhere with my limited tries with preprocess_html. The results are strange and I'm having my difficulties to get to grips with the beatiful soup documentation. Nevertheless, can't I do the trick possibly more easily with preprocess_regexps? My current status is as follows: Code: preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">.+) (</span>)', re.DOTALL\|re.IGNORECASE), lambda match: "\1'. '\2")] But as a result I don't see any change in the output. Could it be, that the braketing of the RegExp Parts and the referencing with \1 or \2 does not work in this case? I found some useful expamples for preprocess_regexps here, however I havn't found a way documented to include the match form the search in the replace part. Many thanks in advance for any useful hints in this matter. Hegi.

05-19-2013, 11:16 AM	#12
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	preprocess_regexps -- use of variables in the replace string ... me again! finally got it working. Here the Regex code, that does the trick: Code: preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL\|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))] ... just to have this documented here. Hegi.

04-18-2013, 02:57 PM	#4
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	Kovid, ...me again. This is really strange: Why do I get completely different behaviour / output when I run the recipe from the cli with ebook-convert than when I run it from calibre gui with "download now"? On the cli things work much neater than form the gui. (E.g. from cli the css works with :after tag, publisher tag is used - instead saying just "calibre"). - This is weired. I think, I'm just going bananas. Hegi.

04-20-2013, 12:00 AM	#5
kovidgoyal creator of calibre Posts: 46,410 Karma: 29634066 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Presumably because you are running different versions of calibre.

04-21-2013, 01:36 PM	#6
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	Hi Kovid, ... so, did a complete clean new install of 0.9.27 using the official binaries (amd64) from your website and the python installer, uninstalled the version in the chroot and now there should be a clean an actual calibre environment. What I notice is the following: - whether the :after CSS is working or not depends on the selected output format. In the gui options I have ".mobi" as preferred output format (in order to email that automatically to my kindle pw). Previously I made an .epub form the cli. Now I changed that to ".mobi" as well. RESULT: If the Output format is .mobi, the :after CSS does not work, if it is .epub it does. - Could this possibly be a buggy behaviour? - The other differences in output seem to be related to to format as well. - When creating .epub I get a Header (Menu buttons) and Footer ("downloaded by calibre ...", Menu buttons). So the real issue seems to be, why CSS :after does not work with .mobi format. I would be delighted, if this hint helps to discover a bug. Thanks Hegi.

04-21-2013, 10:59 PM	#7
kovidgoyal creator of calibre Posts: 46,410 Karma: 29634066 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The MOBI format has no support for CSS. You must use either epub or azw3, but not that amazon does not support periodicals in the azw3 format.

04-23-2013, 02:40 AM	#8
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	@hegi, if you are develpoing a general recipe for a wide range of readers you need to be carefull with predefined formats. Use as less as possible. You will find these differences between devices and formats.

04-23-2013, 03:17 PM	#9
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	@Divingduck: Stay cool and calm. - I work from two ends: Firstly, I want to make the recipe work with general options. Secondly, I want to optimize for my own device. The options for the latter bit can then be commented an whoever likes them, can switch them back on. - All will be well! However, the deeper I dig into this, the more complicated things seem to get. And I'm really busy these days, so things progress very slowly. Hegi.

05-19-2013, 11:47 AM	#14
kovidgoyal creator of calibre Posts: 46,410 Karma: 29634066 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://bazaar.launchpad.net/~kovid/c...revision/15047

05-19-2013, 01:37 PM	#15
hegi Enthusiast Posts: 44 Karma: 10 Join Date: Dec 2012 Device: Kindle 4 & Kindle PW 3G	Thanks Kovid, that was really quick! Hegi.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
LWN.net Weekly News recipe	davide125	Recipes	22	11-12-2014 09:44 PM
Business Week Recipe duplicates	Mixx	Recipes	0	09-16-2012 06:43 AM
beam-ebooks.de: Recipe to download weekly new content?	Rince123	Recipes	0	01-02-2012 03:39 AM
Recipe for Sunday Business Post - Ireland	anne.oneemas	Recipes	15	12-13-2010 05:13 PM
Recipe for Business Spectator (Australia)	RedDogInCan	Recipes	1	12-01-2010 12:34 AM

Advert

Advert