Thread: Handelsblatt
View Single Post
Old 02-19-2011, 03:07 PM   #5
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
Okay guys, sorry for the double (technically even triple) post. But I really put some effort on this and hope somebody is willing to help.

I think Handelsblatt basically changed the way they are linking a print-version to an article. Here an example:

regular:
http://www.handelsblatt.com/politik/...t/3862170.html

print:
http://www.handelsblatt.com/politik/...t,3862170.html

So I thought all I have to do is to change the def print_version on the very bottom
Code:
import re

from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt2'
    __author__ = 'malfi'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'
    keep_only_tags = []
    keep_only_tags.append(dict(name = 'div', attrs = {'class': 'structOneCol'}))
    keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
    remove_tags    = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''

    def print_version(self, url):
         m = re.search('[0-9]*(?=\.html)', url)
         n = re.search('.(?=[0-9]*\.html)',url)
         return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'
In theory it should go like this:
Code:
 m = re.search('[0-9]*(?=\.html)', url)
Search for any combination of digits that is followed by ".html" and call it "m". I used the "\" to prevent invoking the special meaning of ".".

Code:
n = re.search('.(?=[0-9]*\.html)',url)
Search for anything that precedes "'number'.html" and call it "n".

Code:
return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'
Assemble my print-version out of "anything that precedes "'number'.html"" + the term "'v_detail_tab_print," + "number" + ".html"


Unfortunately it doesn't work but I have this feeling that I am pretty close to the solution but just have done a small error somewhere! Can anybody see it?
Thank you!
Moik is offline   Reply With Quote