Handelsblatt

malfi · 11-12-2010, 01:17 PM

hey Kovid, this is ready to be built in ;-)

Code:

import re

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'
    keep_only_tags = []
    keep_only_tags.append(dict(name = 'div', attrs = {'class': 'structOneCol'}))
    keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
    remove_tags    = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''
    def print_version(self, url):
         m = re.search('(?<=;)[0-9]*', url)
         return u'http://www.handelsblatt.com/_b=' + str(m.group(0)) + ',_p=21,_t=ftprint,doc_page=0;printpage'

ganymede · 11-14-2010, 08:03 AM

Cool!

Kleine Anregung:
Manche Artikel sind nur für Abonnenten freigeschaltet. Vielleicht kann man den entsprechenden Login noch hinterlegen ...

Moik · 02-02-2011, 01:51 AM

Thank you very much! I tried it and it is awesome!
If possible I would also appreciate the implementation of a subscriber's login.
I is it possible to implement the picture shows on the Kindle?

Moik · 02-19-2011, 03:07 AM

Oh no, Handelsblatt has updated their homepage... it is not working anymore

Moik · 02-19-2011, 03:07 PM

Okay guys, sorry for the double (technically even triple) post. But I really put some effort on this and hope somebody is willing to help.

I think Handelsblatt basically changed the way they are linking a print-version to an article. Here an example:

regular:
http://www.handelsblatt.com/politik/...t/3862170.html

print:
http://www.handelsblatt.com/politik/...t,3862170.html

So I thought all I have to do is to change the def print_version on the very bottom

Code:

import re

from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt2'
    __author__ = 'malfi'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'
    keep_only_tags = []
    keep_only_tags.append(dict(name = 'div', attrs = {'class': 'structOneCol'}))
    keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
    remove_tags    = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''

    def print_version(self, url):
         m = re.search('[0-9]*(?=\.html)', url)
         n = re.search('.(?=[0-9]*\.html)',url)
         return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'

In theory it should go like this:

Code:

 m = re.search('[0-9]*(?=\.html)', url)

Search for any combination of digits that is followed by ".html" and call it "m". I used the "\" to prevent invoking the special meaning of ".".

Code:

n = re.search('.(?=[0-9]*\.html)',url)

Search for anything that precedes "'number'.html" and call it "n".

Code:

return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'

Assemble my print-version out of "anything that precedes "'number'.html"" + the term "'v_detail_tab_print," + "number" + ".html"

Unfortunately it doesn't work but I have this feeling that I am pretty close to the solution but just have done a small error somewhere! Can anybody see it?
Thank you!

kovidgoyal · 03-20-2011, 08:24 PM

a much easier way is

url = url.split('/')
url[-1] = 'v_detail_tab_print,'+url[-1]
return '/'.join(url)

Dereks · 03-23-2011, 11:13 AM

Quote:

Originally Posted by kovidgoyal

a much easier way is

url = url.split('/')
url[-1] = 'v_detail_tab_print,'+url[-1]
return '/'.join(url)

Regrettably doesn't work neither, this identifies the articles, downloads their description for contents section, but the body of the article only contains hyperlink to it.

Starson17 · 03-23-2011, 11:30 AM

Quote:

Originally Posted by Dereks

Regrettably doesn't work neither, this identifies the articles, downloads their description for contents section, but the body of the article only contains hyperlink to it.

Kovid's code produces the print link described above, but it does it more easily. If that link doesn't work, then it isn't the right one or there's something else wrong with the recipe.

marvin_2 · 03-23-2011, 07:08 PM

I didn't get the print version to work, but found the standard site surprisingly manageable with the keep_tags option. The attached version should be ok. The style sheet is rather messy, though, it would be nice if the print version could be fixed.

Spoiler:

Code:

import re

from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt'
    __author__ = 'malfi'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'
  #  keep_only_tags = []
    keep_only_tags = (dict(name = 'div', attrs = {'class': ['hcf-detail-abstract hcf-teaser ajaxify','hcf-detail','hcf-author-wrapper']}))
   # keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
    remove_tags    = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})
                      ,dict(name='ul' , attrs={'class':['hcf-detail-tools']})
    									]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]
    extra_css = '''
        .hcf-headline {font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:x-large;}
        .hcf-overline {font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:x-large;}
        .hcf-exclusive {font-family:Arial,Helvetica,sans-serif; font-style:italic;font-weight:bold; margin-right:5pt;}
        p{font-family:Arial,Helvetica,sans-serif;}
        .hcf-location-mark{font-weight:bold; margin-right:5pt;}
        .MsoNormal{font-family:Helvetica,Arial,sans-serif;}
        .hcf-author-wrapper{font-style:italic;}
        .hcf-article-date{font-size:x-small;}
        .hcf-caption {font-style:italic;font-size:small;}
        img {align:left;}
        '''

Moik · 03-25-2011, 09:59 PM

I tried Kovid's version as well and also found that it only produces hyperlinks. But when I look at the details of fetching the article it says it fetches the standard version of the article not the printing version.

Code:

Fetching http://www.handelsblatt.com/panorama/aus-aller-welt/lage-in-fukushima-immer-dramatischer/3991192.html

Moik · 03-26-2011, 04:43 PM

Quote:

I didn't get the print version to work, but found the standard site surprisingly manageable with the keep_tags option. The attached version should be ok. The style sheet is rather messy, though, it would be nice if the print version could be fixed.

This is the new code that is implemented in Calibre now. Unfortunatley it works neither. It only downloads the first page of a news article if there are several sections!

Moik · 04-19-2011, 12:40 AM

Has anybody any further ideas to solve this? I keep hoping to see "Handelsblatt" at recipe updates on each calibre update but I never see it. And I am afraid my means of programming are at an end...

aerodynamik · 04-20-2011, 06:52 PM

Quote:

Originally Posted by Moik

Has anybody any further ideas to solve this? I keep hoping to see "Handelsblatt" at recipe updates on each calibre update but I never see it. And I am afraid my means of programming are at an end...

Kovid's print version code works fine for me.
I changed the remove_tags to a remove_tags_before and _after, and removed the non-existing logo.

Try this:

Spoiler:

I just ran this recipe successfully.

It takes kind of long (10 minutes) and a mobi for Kindle is about 5 MB, I think mostly because of the many images. Maybe it would also be sufficient to reduce the number of feeds. Let me know if you are happy how it is or if you want to change something

Moik · 04-20-2011, 11:08 PM

Thank you so much aerodynamik! It finally seems to work again! Thank you! I really do appreciate it. And on first sight it even seems better than ever before!
I don't know what went wrong before...

Dereks · 04-23-2011, 01:33 PM

Wow! Vielen Dank, aerodynamik, endlich funktioniert es!

03-25-2011, 09:59 PM	#10
Moik Member Retired Posts: 47 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	I tried Kovid's version as well and also found that it only produces hyperlinks. But when I look at the details of fetching the article it says it fetches the standard version of the article not the printing version. Code: Fetching http://www.handelsblatt.com/panorama/aus-aller-welt/lage-in-fukushima-immer-dramatischer/3991192.html

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe Request for Handelsblatt [GER]	Moik	Recipes	6	10-15-2010 07:13 PM

11-14-2010, 08:03 AM	#2
ganymede Connoisseur Posts: 57 Karma: 10 Join Date: Nov 2009 Device: Kindle 3	Cool! Kleine Anregung: Manche Artikel sind nur für Abonnenten freigeschaltet. Vielleicht kann man den entsprechenden Login noch hinterlegen ...

02-02-2011, 01:51 AM	#3
Moik Member Retired Posts: 47 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Thank you very much! I tried it and it is awesome! If possible I would also appreciate the implementation of a subscriber's login. I is it possible to implement the picture shows on the Kindle?

02-19-2011, 03:07 AM	#4
Moik Member Retired Posts: 47 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Oh no, Handelsblatt has updated their homepage... it is not working anymore

03-20-2011, 08:24 PM	#6
kovidgoyal creator of calibre Posts: 46,357 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	a much easier way is url = url.split('/') url[-1] = 'v_detail_tab_print,'+url[-1] return '/'.join(url)

04-19-2011, 12:40 AM	#12
Moik Member Retired Posts: 47 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Has anybody any further ideas to solve this? I keep hoping to see "Handelsblatt" at recipe updates on each calibre update but I never see it. And I am afraid my means of programming are at an end...

04-20-2011, 11:08 PM	#14
Moik Member Retired Posts: 47 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Thank you so much aerodynamik! It finally seems to work again! Thank you! I really do appreciate it. And on first sight it even seems better than ever before! I don't know what went wrong before...

04-23-2011, 01:33 PM	#15
Dereks Connoisseur Posts: 57 Karma: 10 Join Date: Feb 2010 Device: Kindle Paperwhite 1	Wow! Vielen Dank, aerodynamik, endlich funktioniert es!