Okay guys, sorry for the double (technically even triple) post. But I really put some effort on this and hope somebody is willing to help.
I think Handelsblatt basically changed the way they are linking a print-version to an article. Here an example:
regular:
http://www.handelsblatt.com/politik/...t/3862170.html
print:
http://www.handelsblatt.com/politik/...t,3862170.html
So I thought all I have to do is to change the def print_version on the very bottom
Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe
class Handelsblatt(BasicNewsRecipe):
title = u'Handelsblatt2'
__author__ = 'malfi'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
language = 'de'
keep_only_tags = []
keep_only_tags.append(dict(name = 'div', attrs = {'class': 'structOneCol'}))
keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
remove_tags = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})]
feeds = [
(u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
(u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
(u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
(u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
(u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
(u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
(u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
(u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
(u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
(u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
]
extra_css = '''
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
'''
def print_version(self, url):
m = re.search('[0-9]*(?=\.html)', url)
n = re.search('.(?=[0-9]*\.html)',url)
return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'
In theory it should go like this:
Code:
m = re.search('[0-9]*(?=\.html)', url)
Search for any combination of digits that is followed by ".html" and call it "m". I used the "\" to prevent invoking the special meaning of ".".
Code:
n = re.search('.(?=[0-9]*\.html)',url)
Search for anything that precedes "'number'.html" and call it "n".
Code:
return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'
Assemble my print-version out of "anything that precedes "'number'.html"" + the term "'v_detail_tab_print," + "number" + ".html"
Unfortunately it doesn't work but I have this feeling that I am pretty close to the solution but just have done a small error somewhere! Can anybody see it?
Thank you!