|
|
#1 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Recipe for Wirtschaftswoche / Wiwo.de (German Business Weekly)
HiHo,
took the time to build a recipe for German Wirtschaftswoche based on Malfi's Handelsblatt recipe. - It's already very usable, though I still have two things I'd like to optimize. I hope you guys can help. Let's start with the Recipe "as is" first: Code:
##
## Title: Wirtschaftswoche Online - wiwo.de
## Contact: Hegi - hegi@teleos-web.de
##
## License: GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
## Copyright: Hegi - hegi@teleos-web.de
## Based on: "Handelsblatt" Recipe by malfi with ideas form the "BBC" Recipe by mattst / Thanks for these examples!
##
## Written: April 2013
## Last Edited: 2013-04-07
##
from calibre.web.feeds.news import BasicNewsRecipe
class Wirtschaftswoche(BasicNewsRecipe):
title = u'Wirtschaftswoche - WiWo.de'
description = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
cover_url = 'http://upload.wikimedia.org/wikipedia/de/thumb/b/b9/Wirtschaftswoche-Logo.svg/641px-Wirtschaftswoche-Logo.svg.png'
tags = 'Nachrichten, Blog, Wirtschaft'
publisher = 'Verlagsgruppe Handelsblatt'
publication_type = 'newspaper'
__author__ = 'Hegi'
__license__ = 'GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html'
__copyright__ = 'Hegi - hegi@teleos-web.de'
simultaneous_downloads = 20
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
language = 'de_DE'
remove_empty_feeds = True
ignore_duplicate_articles = {'title', 'url'}
compress_news_images_auto_size = 16
conversion_options = { 'title' : title,
'comments' : description,
'tags' : tags,
'language' : language,
'publisher' : publisher,
'authors' : publisher,
'smarten_punctuation' : True
}
remove_tags_before = dict(attrs={'class':'hcf-overline'})
#remove_tags_after = dict(attrs={'class':'hcf-footer'})
remove_tags_after = dict(attrs={'class':'hcf-meta-nav'})
feeds = [
(u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'),
(u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
(u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'),
(u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'),
(u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'),
(u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'),
(u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'),
(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/')
]
extra_css = 'h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} \
h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} \
p{font-family:Arial,Helvetica,sans-serif;font-size:small;} \
body{font-family:Helvetica,Arial,sans-serif;font-size:small;}'
def print_version(self, url):
url = url.split('/')
url[-1] = 'v_detail_tab_print,'+url[-1]
url = '/'.join(url)
return url
1. When an article starts with a "place", the source html looks as follows: Code:
<span class="hcf-location-mark">New York</span> 2. The end of the article text looks in html like this: Code:
[...]<div id="hcf-footer"><div class="hcf-copyright"> <div class="hcf-copyright-inner"> © 2011 Handelsblatt GmbH - ein Unternehmen der Verlagsgruppe Handelsblatt GmbH & Co. KG </div> </div> <div class="hcf-meta-nav"> [...] Thanks a lot for your help. - And hope the recipe is useful for others, too. Hegi. |
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You need to update your version of calibre first. Then just add the following extra css to the recipe
Code:
extra_css = '''
.hcf-location-mark:after {
content: ". "
}
.hcf_location-mark {
font-style: italic
}
'''
As for the second, there is likely something you are missing, all the best tracking it down
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Pseudo CSS :after
Hi Kovid,
thanks for you quick reply. My life is a bit crazy these days, so it took me longer to get back to you. AND - I tried quite a few things in the meantime. Nevertheless I'm still hanging with the :after CSS tag. Currently my extra_css looks like this: Code:
extra_css = 'h1 {font-size: 1.6em; text-align: left} \
h2 {font-size: 1em; font-style: italic; font-weight: normal} \
h3 {font-size: 1.3em;text-align: left} \
h4, h5, h6, a {font-size: 1em;text-align: left} \
.hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
.hcf-location-mark:after {content: ". "} \
.hcf-location-mark {font-style: italic}'
It also says in the changelog, that as of 0.9.24 it is possible to "reduce the size of downloaded images by lowering their quality". I assume this refers to the options "compress_news_images_max_size" and "compress_news_images_auto_size". - But it doesn't appear to have a significant effect. Very strange! I'm running calibre in an ia32 chroot on an debian amd64 system. But all seems fine: Code:
[$ calibre --version calibre (calibre 0.9.25) Last question: When the recipe is running satisfactory, then it's here the place to post the final version, correct? Thanks a lot! Hegi. |
|
|
|
|
|
#4 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Kovid,
...me again. This is *really* strange: Why do I get completely different behaviour / output when I run the recipe from the cli with ebook-convert than when I run it from calibre gui with "download now"? On the cli things work much neater than form the gui. (E.g. from cli the css works with :after tag, publisher tag is used - instead saying just "calibre"). - This is weired. I think, I'm just going bananas. Hegi. |
|
|
|
|
|
#5 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Presumably because you are running different versions of calibre.
|
|
|
|
| Advert | |
|
|
|
|
#6 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Hi Kovid,
... so, did a complete clean new install of 0.9.27 using the official binaries (amd64) from your website and the python installer, uninstalled the version in the chroot and now there should be a clean an actual calibre environment. What I notice is the following: - whether the :after CSS is working or not depends on the selected output format. In the gui options I have ".mobi" as preferred output format (in order to email that automatically to my kindle pw). Previously I made an .epub form the cli. Now I changed that to ".mobi" as well. RESULT: If the Output format is .mobi, the :after CSS does not work, if it is .epub it does. - Could this possibly be a buggy behaviour? - The other differences in output seem to be related to to format as well. - When creating .epub I get a Header (Menu buttons) and Footer ("downloaded by calibre ...", Menu buttons). So the real issue seems to be, why CSS :after does not work with .mobi format. I would be delighted, if this hint helps to discover a bug. Thanks Hegi. |
|
|
|
|
|
#7 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The MOBI format has no support for CSS. You must use either epub or azw3, but not that amazon does not support periodicals in the azw3 format.
|
|
|
|
|
|
#8 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
@hegi, if you are develpoing a general recipe for a wide range of readers you need to be carefull with predefined formats. Use as less as possible. You will find these differences between devices and formats.
|
|
|
|
|
|
#9 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
@Divingduck: Stay cool and calm. - I work from two ends: Firstly, I want to make the recipe work with general options. Secondly, I want to optimize for my own device. The options for the latter bit can then be commented an whoever likes them, can switch them back on. - All will be well!
However, the deeper I dig into this, the more complicated things seem to get. And I'm really busy these days, so things progress *very* slowly. Hegi. |
|
|
|
|
|
#10 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Hiho,
... still optimizing ... and still going a bit crazy, since I have only superficial programming skills .Now, if the clever CSS hint from Kovid won't work for .mobi format, I ask myself, if I could not achieve the same using preprocess_html. What I get as input form the webiste is: Code:
<span class="hcf-location-mark">Place</span> Code:
def preprocess_html(self, soup):
for location in soup.find('span', attrs={'class':'hcf-location-mark'}):
newloc = location.string +". "
location.replaceWith(newloc)
return soup
- But I coudn't find this kind of "search and replace" expample elsewhere yet.Thanks. Hegi. |
|
|
|
|
|
#11 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Hey Folks,
I seem to be getting nowhere with my limited tries with preprocess_html. The results are strange and I'm having my difficulties to get to grips with the beatiful soup documentation. Nevertheless, can't I do the trick possibly more easily with preprocess_regexps? My current status is as follows: Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">.+) (</span>)', re.DOTALL|re.IGNORECASE), lambda match: "\1'. '\2")] I found some useful expamples for preprocess_regexps here, however I havn't found a way documented to include the match form the search in the replace part. Many thanks in advance for any useful hints in this matter. Hegi. |
|
|
|
|
|
#12 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
preprocess_regexps -- use of variables in the replace string
finally got it working. Here the Regex code, that does the trick: Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))] Hegi. |
|
|
|
|
|
#13 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Hi Folks,
... after a couple of weeks fiddling about, here my "production quality" recipe for WirtschaftsWoche Online. - Enjoy .The template I began with is from Divingduck and I got his clearance for posting my modified version here: Code:
__license__ = 'GPL v3'
__copyright__ = '2013, Armin Geller'
'''
Fetch WirtschaftsWoche Online
'''
import re
#import time
from calibre.web.feeds.news import BasicNewsRecipe
class WirtschaftsWocheOnline(BasicNewsRecipe):
title = u'WirtschaftsWoche Online'
__author__ = 'Armin Geller' # Update AGE 2013-01-05; Modified by Hegi 2013-04-28
description = u'Wirtschaftswoche Online - basierend auf den RRS-Feeds von Wiwo.de'
tags = 'Nachrichten, Blog, Wirtschaft'
publisher = 'Verlagsgruppe Handelsblatt GmbH / Redaktion WirtschaftsWoche Online'
category = 'business, economy, news, Germany'
publication_type = 'weekly magazine'
language = 'de_DE'
oldest_article = 7
max_articles_per_feed = 100
simultaneous_downloads= 20
auto_cleanup = False
no_stylesheets = True
remove_javascript = True
remove_empty_feeds = True
# don't duplicate articles from "Schlagzeilen" / "Exklusiv" to other rubrics
ignore_duplicate_articles = {'title', 'url'}
# if you want to reduce size for an b/w or E-ink device, uncomment this:
# compress_news_images = True
# compress_news_images_auto_size = 16
# scale_news_images = (400,300)
timefmt = ' [%a, %d %b %Y]'
conversion_options = { 'smarten_punctuation' : True,
'authors' : publisher,
'publisher' : publisher }
language = 'de_DE'
encoding = 'UTF-8'
cover_source = 'http://www.wiwo-shop.de/wirtschaftswoche/wirtschaftswoche-emagazin-p1952.html'
masthead_url = 'http://www.wiwo.de/images/wiwo_logo/5748610/1-formatOriginal.png'
def get_cover_url(self):
cover_source_soup = self.index_to_soup(self.cover_source)
preview_image_div = cover_source_soup.find(attrs={'class':'container vorschau'})
return 'http://www.wiwo-shop.de'+preview_image_div.a.img['src']
# Insert ". " after "Place" in <span class="hcf-location-mark">Place</span>
# If you use .epub format you could also do this as extra_css '.hcf-location-mark:after {content: ". "}'
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2))]
extra_css = 'h1 {font-size: 1.6em; text-align: left} \
h2 {font-size: 1em; font-style: italic; font-weight: normal} \
h3 {font-size: 1.3em;text-align: left} \
h4, h5, h6, a {font-size: 1em;text-align: left} \
.hcf-caption {font-size: 1em;text-align: left; font-style: italic} \
.hcf-location-mark {font-style: italic}'
keep_only_tags = [
dict(name='div', attrs={'class':['hcf-column hcf-column1 hcf-teasercontainer hcf-maincol']}),
dict(name='div', attrs={'id':['contentMain']})
]
remove_tags = [
dict(name='div', attrs={'class':['hcf-link-block hcf-faq-open', 'hcf-article-related']})
]
feeds = [
(u'Schlagzeilen', u'http://www.wiwo.de/contentexport/feed/rss/schlagzeilen'),
(u'Exklusiv', u'http://www.wiwo.de/contentexport/feed/rss/exklusiv'),
#(u'Themen', u'http://www.wiwo.de/contentexport/feed/rss/themen'), # AGE no print version
(u'Unternehmen', u'http://www.wiwo.de/contentexport/feed/rss/unternehmen'),
(u'Finanzen', u'http://www.wiwo.de/contentexport/feed/rss/finanzen'),
(u'Politik', u'http://www.wiwo.de/contentexport/feed/rss/politik'),
(u'Erfolg', u'http://www.wiwo.de/contentexport/feed/rss/erfolg'),
(u'Technologie', u'http://www.wiwo.de/contentexport/feed/rss/technologie'),
#(u'Green-WiWo', u'http://green.wiwo.de/feed/rss/') # AGE no print version
]
def print_version(self, url):
main, sep, id = url.rpartition('/')
return main + '/v_detail_tab_print/' + id
Thanks to all who helped me getting there Hegi. |
|
|
|
|
|
#14 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
|
|
|
|
|
#15 |
|
Enthusiast
![]() Posts: 44
Karma: 10
Join Date: Dec 2012
Device: Kindle 4 & Kindle PW 3G
|
Thanks Kovid,
that was really quick! Hegi. |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| LWN.net Weekly News recipe | davide125 | Recipes | 22 | 11-12-2014 09:44 PM |
| Business Week Recipe duplicates | Mixx | Recipes | 0 | 09-16-2012 06:43 AM |
| beam-ebooks.de: Recipe to download weekly new content? | Rince123 | Recipes | 0 | 01-02-2012 03:39 AM |
| Recipe for Sunday Business Post - Ireland | anne.oneemas | Recipes | 15 | 12-13-2010 05:13 PM |
| Recipe for Business Spectator (Australia) | RedDogInCan | Recipes | 1 | 12-01-2010 12:34 AM |