Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-07-2013, 11:29 AM   #1
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Recipe for Wirtschaftswoche / Wiwo.de (German Business Weekly)

HiHo,

took the time to build a recipe for German Wirtschaftswoche based on Malfi's Handelsblatt recipe. - It's already very usable, though I still have two things I'd like to optimize. I hope you guys can help.

Let's start with the Recipe "as is" first:
Code:
##
## Title:        Wirtschaftswoche Online - wiwo.de
## Contact:      Hegi - hegi@teleos-web.de
##
## License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
## Copyright:    Hegi - hegi@teleos-web.de
## Based on:     "Handelsblatt" Recipe by malfi with ideas form the "BBC" Recipe by mattst / Thanks for these examples!
##
## Written:      April 2013
## Last Edited:  2013-04-07
##


from calibre.web.feeds.news import BasicNewsRecipe


class Wirtschaftswoche(BasicNewsRecipe):
    title            = u'Wirtschaftswoche - WiWo.de'
    description      = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
    cover_url        = 'http://upload.wikimedia.org/wikipedia/de/thumb/b/b9/Wirtschaftswoche-Logo.svg/641px-Wirtschaftswoche-Logo.svg.png'
    tags 	           = 'Nachrichten, Blog, Wirtschaft'
    publisher        = 'Verlagsgruppe Handelsblatt'
    publication_type = 'newspaper'


    __author__       = 'Hegi'
    __license__      = 'GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html'
    __copyright__    = 'Hegi - hegi@teleos-web.de'
 
    simultaneous_downloads = 20
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    language = 'de_DE'
 
    remove_empty_feeds = True
    ignore_duplicate_articles = {'title', 'url'}
    compress_news_images_auto_size = 16
    conversion_options = { 'title'       : title,
                           'comments'    : description,
                           'tags'        : tags,
                           'language'    : language,
                           'publisher'   : publisher,
                           'authors'     : publisher,
                           'smarten_punctuation' : True
                         }


    remove_tags_before =  dict(attrs={'class':'hcf-overline'})
    #remove_tags_after  =  dict(attrs={'class':'hcf-footer'})
    remove_tags_after  =  dict(attrs={'class':'hcf-meta-nav'})


    feeds          = [
	(u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
	(u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'), 
	(u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
	(u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
	(u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
	(u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
	(u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
	(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/')
    ]


    extra_css = 'h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} \
                 h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} \
                 p{font-family:Arial,Helvetica,sans-serif;font-size:small;} \
                 body{font-family:Helvetica,Arial,sans-serif;font-size:small;}'      


    def print_version(self, url):
        url = url.split('/')
        url[-1] = 'v_detail_tab_print,'+url[-1]
        url = '/'.join(url)
        return url
Now let's come to the tricky bits:

1. When an article starts with a "place", the source html looks as follows:

Code:
<span class="hcf-location-mark">New York</span>
However, on the WiWo.de Website, this is shown as "New York. " (with the tailing dot and space). How can I incorporate this in my recipe? Is this possible with extra_css or with RegEx?

2. The end of the article text looks in html like this:

Code:
[...]<div id="hcf-footer"><div class="hcf-copyright">
 <div class="hcf-copyright-inner">
 &copy; 2011 Handelsblatt GmbH - ein Unternehmen der Verlagsgruppe Handelsblatt GmbH &amp; Co. KG
 </div>
</div>
<div class="hcf-meta-nav"> [...]
No matter whether I use "remove_tags_after = dict(attrs={'id':'hcf-footer'})" or "remove_tags_after = dict(attrs={'class':'hcf-meta-nav'})", I still get the Service-Nav at the end of each article. Do I misunderstand the option or could this be realated to the fact, that I have only Calibre 0.8.51 installed?

Thanks a lot for your help. - And hope the recipe is useful for others, too.

Hegi.
hegi is offline   Reply With Quote
Old 04-07-2013, 11:52 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You need to update your version of calibre first. Then just add the following extra css to the recipe

Code:
extra_css = '''
.hcf-location-mark:after {
    content: ". "
}
.hcf_location-mark {
    font-style: italic
}
'''
Support for :before and :after pseudo selectors in calibre is very recent, so you *must* upgrade.

As for the second, there is likely something you are missing, all the best tracking it down
kovidgoyal is offline   Reply With Quote
Advert
Old 04-18-2013, 02:12 PM   #3
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Pseudo CSS :after

Hi Kovid,

thanks for you quick reply. My life is a bit crazy these days, so it took me longer to get back to you. AND - I tried quite a few things in the meantime.

Nevertheless I'm still hanging with the :after CSS tag.

Currently my extra_css looks like this:

Code:
    extra_css      =  'h1 {font-size: 1.6em; text-align: left} \
                       h2 {font-size: 1em; font-style: italic; font-weight: normal} \
                       h3 {font-size: 1.3em;text-align: left} \
                       h4, h5, h6, a {font-size: 1em;text-align: left} \
                       .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
  	             .hcf-location-mark:after {content: ". "} \
	             .hcf-location-mark {font-style: italic}'
The Italics from the last line work. The insertion of the ". " doesn't, despite an upgrade to Calibre 0.9.25 - According to the changelog the before/after tags were fixed in 0.9.24. - This is strange. Did I mess up anything else here?

It also says in the changelog, that as of 0.9.24 it is possible to "reduce the size of downloaded images by lowering their quality". I assume this refers to the options "compress_news_images_max_size" and "compress_news_images_auto_size". - But it doesn't appear to have a significant effect. Very strange!

I'm running calibre in an ia32 chroot on an debian amd64 system. But all seems fine:

Code:
[$ calibre --version
calibre (calibre 0.9.25)
Can I get more information from the gui when running the recipe, or do I have to run ebook-convert on the command-line for more debugging information?

Last question: When the recipe is running satisfactory, then it's here the place to post the final version, correct?

Thanks a lot!

Hegi.
hegi is offline   Reply With Quote
Old 04-18-2013, 02:57 PM   #4
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Kovid,
...me again.

This is *really* strange: Why do I get completely different behaviour / output when I run the recipe from the cli with ebook-convert than when I run it from calibre gui with "download now"?

On the cli things work much neater than form the gui. (E.g. from cli the css works with :after tag, publisher tag is used - instead saying just "calibre"). - This is weired.

I think, I'm just going bananas.

Hegi.
hegi is offline   Reply With Quote
Old 04-20-2013, 12:00 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Presumably because you are running different versions of calibre.
kovidgoyal is offline   Reply With Quote
Advert
Old 04-21-2013, 01:36 PM   #6
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Hi Kovid,

... so, did a complete clean new install of 0.9.27 using the official binaries (amd64) from your website and the python installer, uninstalled the version in the chroot and now there should be a clean an actual calibre environment.

What I notice is the following:
- whether the :after CSS is working or not depends on the selected output format. In the gui options I have ".mobi" as preferred output format (in order to email that automatically to my kindle pw). Previously I made an .epub form the cli. Now I changed that to ".mobi" as well. RESULT: If the Output format is .mobi, the :after CSS does not work, if it is .epub it does. - Could this possibly be a buggy behaviour?

- The other differences in output seem to be related to to format as well. - When creating .epub I get a Header (Menu buttons) and Footer ("downloaded by calibre ...", Menu buttons).

So the real issue seems to be, why CSS :after does not work with .mobi format.

I would be delighted, if this hint helps to discover a bug.

Thanks

Hegi.
hegi is offline   Reply With Quote
Old 04-21-2013, 10:59 PM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The MOBI format has no support for CSS. You must use either epub or azw3, but not that amazon does not support periodicals in the azw3 format.
kovidgoyal is offline   Reply With Quote
Old 04-23-2013, 02:40 AM   #8
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,161
Karma: 1404241
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
@hegi, if you are develpoing a general recipe for a wide range of readers you need to be carefull with predefined formats. Use as less as possible. You will find these differences between devices and formats.
Divingduck is offline   Reply With Quote
Old 04-23-2013, 03:17 PM   #9
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
@Divingduck: Stay cool and calm. - I work from two ends: Firstly, I want to make the recipe work with general options. Secondly, I want to optimize for my own device. The options for the latter bit can then be commented an whoever likes them, can switch them back on. - All will be well!

However, the deeper I dig into this, the more complicated things seem to get. And I'm really busy these days, so things progress *very* slowly.

Hegi.
hegi is offline   Reply With Quote
Old 04-29-2013, 02:39 PM   #10
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Question preprocess_html instead of extra_css

Hiho,

... still optimizing ... and still going a bit crazy, since I have only superficial programming skills.

Now, if the clever CSS hint from Kovid won't work for .mobi format, I ask myself, if I could not achieve the same using preprocess_html. What I get as input form the webiste is:

Code:
<span class="hcf-location-mark">Place</span>
In order to add a ". " after "Place" can't I do something like:

Code:
    def preprocess_html(self, soup):
        for location in soup.find('span', attrs={'class':'hcf-location-mark'}):
                newloc = location.string +". "
                location.replaceWith(newloc)
        return soup
This is "reverse-engineering" from other recipes. So please don't hit me if the syntax is a bit foolish, OK? - But I coudn't find this kind of "search and replace" expample elsewhere yet.

Thanks.

Hegi.
hegi is offline   Reply With Quote
Old 05-18-2013, 06:11 AM   #11
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Hey Folks,

I seem to be getting nowhere with my limited tries with preprocess_html. The results are strange and I'm having my difficulties to get to grips with the beatiful soup documentation.

Nevertheless, can't I do the trick possibly more easily with preprocess_regexps?

My current status is as follows:

Code:
preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">.+) (</span>)', re.DOTALL|re.IGNORECASE), lambda match: "\1'. '\2")]
But as a result I don't see any change in the output. Could it be, that the braketing of the RegExp Parts and the referencing with \1 or \2 does not work in this case?

I found some useful expamples for preprocess_regexps here, however I havn't found a way documented to include the match form the search in the replace part.

Many thanks in advance for any useful hints in this matter.

Hegi.
hegi is offline   Reply With Quote
Old 05-19-2013, 11:16 AM   #12
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
preprocess_regexps -- use of variables in the replace string

... me again!

finally got it working. Here the Regex code, that does the trick:

Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]
... just to have this documented here.

Hegi.
hegi is offline   Reply With Quote
Old 05-19-2013, 11:30 AM   #13
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Lightbulb WirtschaftsWoche Online - working recipe optimized

Hi Folks,

... after a couple of weeks fiddling about, here my "production quality" recipe for WirtschaftsWoche Online. - Enjoy .

The template I began with is from Divingduck and I got his clearance for posting my modified version here:

Code:
__license__   = 'GPL v3'
__copyright__ = '2013, Armin Geller'

'''
Fetch WirtschaftsWoche Online
'''
import re
#import time
from calibre.web.feeds.news import BasicNewsRecipe
class WirtschaftsWocheOnline(BasicNewsRecipe):
    title                 = u'WirtschaftsWoche Online'
    __author__            = 'Armin Geller' # Update AGE 2013-01-05; Modified by Hegi 2013-04-28
    description           = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
    tags 	                = 'Nachrichten, Blog, Wirtschaft'
    publisher             = 'Verlagsgruppe Handelsblatt GmbH / Redaktion WirtschaftsWoche Online'
    category              = 'business, economy, news, Germany' 
    publication_type      = 'weekly magazine'
    language              = 'de_DE'
    oldest_article        = 7
    max_articles_per_feed = 100
    simultaneous_downloads= 20
    
    auto_cleanup          = False
    no_stylesheets        = True
    remove_javascript     = True
    remove_empty_feeds    = True

    # don't duplicate articles from "Schlagzeilen" / "Exklusiv" to other rubrics 
    ignore_duplicate_articles = {'title', 'url'}

    # if you want to reduce size for an b/w or E-ink device, uncomment this:
    # compress_news_images  = True
    # compress_news_images_auto_size = 16
    # scale_news_images     = (400,300)
    
    timefmt               = ' [%a, %d %b %Y]'

    conversion_options    = { 'smarten_punctuation' : True,
			'authors'		  : publisher,
			'publisher'  	  : publisher }
    language              = 'de_DE'
    encoding              = 'UTF-8'
    cover_source          = 'http://www.wiwo-shop.de/wirtschaftswoche/wirtschaftswoche-emagazin-p1952.html'
    masthead_url          = 'http://www.wiwo.de/images/wiwo_logo/5748610/1-formatOriginal.png'

    def get_cover_url(self):
       cover_source_soup = self.index_to_soup(self.cover_source)
       preview_image_div = cover_source_soup.find(attrs={'class':'container vorschau'})
       return 'http://www.wiwo-shop.de'+preview_image_div.a.img['src']

    # Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
    # If you use .epub format you could also do this as extra_css '.hcf-location-mark:after {content: ". "}' 
    preprocess_regexps    = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]

    extra_css      =  'h1 {font-size: 1.6em; text-align: left} \
                       h2 {font-size: 1em; font-style: italic; font-weight: normal} \
                       h3 {font-size: 1.3em;text-align: left} \
                       h4, h5, h6, a {font-size: 1em;text-align: left} \
                       .hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
	             .hcf-location-mark {font-style: italic}' 

    keep_only_tags    = [
                          dict(name='div', attrs={'class':['hcf-column hcf-column1 hcf-teasercontainer hcf-maincol']}),
                          dict(name='div', attrs={'id':['contentMain']})
                        ]

    remove_tags = [
                    dict(name='div', attrs={'class':['hcf-link-block hcf-faq-open', 'hcf-article-related']})
                  ]

    feeds = [
              (u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'), 
              (u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
              #(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version
              (u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'), 
              (u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'), 
              (u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'), 
              (u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'), 
              (u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'), 
              #(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version
            ]
    def print_version(self, url):
        main, sep, id = url.rpartition('/')
        return main + '/v_detail_tab_print/' + id
Kovid, it would be great, if you could include this in one of the next releases.

Thanks to all who helped me getting there!

Hegi.
hegi is offline   Reply With Quote
Old 05-19-2013, 11:47 AM   #14
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://bazaar.launchpad.net/~kovid/c...revision/15047
kovidgoyal is offline   Reply With Quote
Old 05-19-2013, 01:37 PM   #15
hegi
Enthusiast
hegi began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
Thanks Kovid,

that was really quick!

Hegi.
hegi is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
LWN.net Weekly News recipe davide125 Recipes 22 11-12-2014 09:44 PM
Business Week Recipe duplicates Mixx Recipes 0 09-16-2012 06:43 AM
beam-ebooks.de: Recipe to download weekly new content? Rince123 Recipes 0 01-02-2012 03:39 AM
Recipe for Sunday Business Post - Ireland anne.oneemas Recipes 15 12-13-2010 05:13 PM
Recipe for Business Spectator (Australia) RedDogInCan Recipes 1 12-01-2010 12:34 AM


All times are GMT -4. The time now is 03:21 PM.


MobileRead.com is a privately owned, operated and funded community.