Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-12-2010, 01:17 PM   #1
malfi
Member
malfi began at the beginning.
 
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
Cool Handelsblatt

hey Kovid, this is ready to be built in ;-)

Code:
import re

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'
    keep_only_tags = []
    keep_only_tags.append(dict(name = 'div', attrs = {'class': 'structOneCol'}))
    keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
    remove_tags    = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''
    def print_version(self, url):
         m = re.search('(?<=;)[0-9]*', url)
         return u'http://www.handelsblatt.com/_b=' + str(m.group(0)) + ',_p=21,_t=ftprint,doc_page=0;printpage'
malfi is offline   Reply With Quote
Old 11-14-2010, 08:03 AM   #2
ganymede
Connoisseur
ganymede began at the beginning.
 
Posts: 57
Karma: 10
Join Date: Nov 2009
Device: Kindle 3
Cool!

Kleine Anregung:
Manche Artikel sind nur für Abonnenten freigeschaltet. Vielleicht kann man den entsprechenden Login noch hinterlegen ...
ganymede is offline   Reply With Quote
Advert
Old 02-02-2011, 01:51 AM   #3
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
Thank you very much! I tried it and it is awesome!
If possible I would also appreciate the implementation of a subscriber's login.
I is it possible to implement the picture shows on the Kindle?
Moik is offline   Reply With Quote
Old 02-19-2011, 03:07 AM   #4
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
Oh no, Handelsblatt has updated their homepage... it is not working anymore
Moik is offline   Reply With Quote
Old 02-19-2011, 03:07 PM   #5
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
Okay guys, sorry for the double (technically even triple) post. But I really put some effort on this and hope somebody is willing to help.

I think Handelsblatt basically changed the way they are linking a print-version to an article. Here an example:

regular:
http://www.handelsblatt.com/politik/...t/3862170.html

print:
http://www.handelsblatt.com/politik/...t,3862170.html

So I thought all I have to do is to change the def print_version on the very bottom
Code:
import re

from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt2'
    __author__ = 'malfi'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'
    keep_only_tags = []
    keep_only_tags.append(dict(name = 'div', attrs = {'class': 'structOneCol'}))
    keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
    remove_tags    = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''

    def print_version(self, url):
         m = re.search('[0-9]*(?=\.html)', url)
         n = re.search('.(?=[0-9]*\.html)',url)
         return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'
In theory it should go like this:
Code:
 m = re.search('[0-9]*(?=\.html)', url)
Search for any combination of digits that is followed by ".html" and call it "m". I used the "\" to prevent invoking the special meaning of ".".

Code:
n = re.search('.(?=[0-9]*\.html)',url)
Search for anything that precedes "'number'.html" and call it "n".

Code:
return str(n.group(0)) + 'v_detail_tab_print,' + str(m.group(0)) + '.html'
Assemble my print-version out of "anything that precedes "'number'.html"" + the term "'v_detail_tab_print," + "number" + ".html"


Unfortunately it doesn't work but I have this feeling that I am pretty close to the solution but just have done a small error somewhere! Can anybody see it?
Thank you!
Moik is offline   Reply With Quote
Advert
Old 03-20-2011, 08:24 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,187
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
a much easier way is

url = url.split('/')
url[-1] = 'v_detail_tab_print,'+url[-1]
return '/'.join(url)
kovidgoyal is online now   Reply With Quote
Old 03-23-2011, 11:13 AM   #7
Dereks
Connoisseur
Dereks began at the beginning.
 
Posts: 57
Karma: 10
Join Date: Feb 2010
Device: Kindle Paperwhite 1
Quote:
Originally Posted by kovidgoyal View Post
a much easier way is

url = url.split('/')
url[-1] = 'v_detail_tab_print,'+url[-1]
return '/'.join(url)
Regrettably doesn't work neither, this identifies the articles, downloads their description for contents section, but the body of the article only contains hyperlink to it.
Dereks is offline   Reply With Quote
Old 03-23-2011, 11:30 AM   #8
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Dereks View Post
Regrettably doesn't work neither, this identifies the articles, downloads their description for contents section, but the body of the article only contains hyperlink to it.
Kovid's code produces the print link described above, but it does it more easily. If that link doesn't work, then it isn't the right one or there's something else wrong with the recipe.
Starson17 is offline   Reply With Quote
Old 03-23-2011, 07:08 PM   #9
marvin_2
Member
marvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura about
 
Posts: 24
Karma: 4472
Join Date: Jan 2011
Device: Kindle
I didn't get the print version to work, but found the standard site surprisingly manageable with the keep_tags option. The attached version should be ok. The style sheet is rather messy, though, it would be nice if the print version could be fixed.

Spoiler:
Code:
import re

from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt'
    __author__ = 'malfi'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'
  #  keep_only_tags = []
    keep_only_tags = (dict(name = 'div', attrs = {'class': ['hcf-detail-abstract hcf-teaser ajaxify','hcf-detail','hcf-author-wrapper']}))
   # keep_only_tags.append(dict(name = 'div', attrs = {'id': 'fullText'}))
    remove_tags    = [dict(name='img', attrs = {'src': 'http://www.handelsblatt.com/images/icon/loading.gif'})
                      ,dict(name='ul' , attrs={'class':['hcf-detail-tools']})
    									]

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]
    extra_css = '''
        .hcf-headline {font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:x-large;}
        .hcf-overline {font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:x-large;}
        .hcf-exclusive {font-family:Arial,Helvetica,sans-serif; font-style:italic;font-weight:bold; margin-right:5pt;}
        p{font-family:Arial,Helvetica,sans-serif;}
        .hcf-location-mark{font-weight:bold; margin-right:5pt;}
        .MsoNormal{font-family:Helvetica,Arial,sans-serif;}
        .hcf-author-wrapper{font-style:italic;}
        .hcf-article-date{font-size:x-small;}
        .hcf-caption {font-style:italic;font-size:small;}
        img {align:left;}
        '''
Attached Files
File Type: zip handelsblatt.zip (887 Bytes, 233 views)
marvin_2 is offline   Reply With Quote
Old 03-25-2011, 09:59 PM   #10
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
I tried Kovid's version as well and also found that it only produces hyperlinks. But when I look at the details of fetching the article it says it fetches the standard version of the article not the printing version.
Code:
Fetching http://www.handelsblatt.com/panorama/aus-aller-welt/lage-in-fukushima-immer-dramatischer/3991192.html
Moik is offline   Reply With Quote
Old 03-26-2011, 04:43 PM   #11
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
Quote:
I didn't get the print version to work, but found the standard site surprisingly manageable with the keep_tags option. The attached version should be ok. The style sheet is rather messy, though, it would be nice if the print version could be fixed.
This is the new code that is implemented in Calibre now. Unfortunatley it works neither. It only downloads the first page of a news article if there are several sections!
Moik is offline   Reply With Quote
Old 04-19-2011, 12:40 AM   #12
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
Has anybody any further ideas to solve this? I keep hoping to see "Handelsblatt" at recipe updates on each calibre update but I never see it. And I am afraid my means of programming are at an end...
Moik is offline   Reply With Quote
Old 04-20-2011, 06:52 PM   #13
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
Quote:
Originally Posted by Moik View Post
Has anybody any further ideas to solve this? I keep hoping to see "Handelsblatt" at recipe updates on each calibre update but I never see it. And I am afraid my means of programming are at an end...
Kovid's print version code works fine for me.
I changed the remove_tags to a remove_tags_before and _after, and removed the non-existing logo.

Try this:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class Handelsblatt(BasicNewsRecipe):
    title          = u'Handelsblatt'
    __author__ = 'malfi'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
#    cover_url = 'http://www.handelsblatt.com/images/logo/logo_handelsblatt.com.png'
    language = 'de'

    remove_tags_before =  dict(attrs={'class':'hcf-overline'})
    remove_tags_after  =  dict(attrs={'class':'hcf-footer'})

    feeds          = [
                        (u'Handelsblatt Exklusiv',u'http://www.handelsblatt.com/rss/exklusiv'),
                        (u'Handelsblatt Top-Themen',u'http://www.handelsblatt.com/rss/top-themen'),
                        (u'Handelsblatt Schlagzeilen',u'http://www.handelsblatt.com/rss/ticker/'),
                        (u'Handelsblatt Finanzen',u'http://www.handelsblatt.com/rss/finanzen/'),
                        (u'Handelsblatt Unternehmen',u'http://www.handelsblatt.com/rss/unternehmen/'),
                        (u'Handelsblatt Politik',u'http://www.handelsblatt.com/rss/politik/'),
                        (u'Handelsblatt Technologie',u'http://www.handelsblatt.com/rss/technologie/'),
                        (u'Handelsblatt Meinung',u'http://www.handelsblatt.com/rss/meinung'),
                        (u'Handelsblatt Magazin',u'http://www.handelsblatt.com/rss/magazin/'),
                        (u'Handelsblatt Weblogs',u'http://www.handelsblatt.com/rss/blogs')
                     ]

    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''

    def print_version(self, url):
        url = url.split('/')
        url[-1] = 'v_detail_tab_print,'+url[-1]
        url = '/'.join(url)
        return url


I just ran this recipe successfully.

It takes kind of long (10 minutes) and a mobi for Kindle is about 5 MB, I think mostly because of the many images. Maybe it would also be sufficient to reduce the number of feeds. Let me know if you are happy how it is or if you want to change something
aerodynamik is offline   Reply With Quote
Old 04-20-2011, 11:08 PM   #14
Moik
Member Retired
Moik began at the beginning.
 
Posts: 47
Karma: 10
Join Date: Oct 2010
Device: Kindle 3
Thank you so much aerodynamik! It finally seems to work again! Thank you! I really do appreciate it. And on first sight it even seems better than ever before!
I don't know what went wrong before...
Moik is offline   Reply With Quote
Old 04-23-2011, 01:33 PM   #15
Dereks
Connoisseur
Dereks began at the beginning.
 
Posts: 57
Karma: 10
Join Date: Feb 2010
Device: Kindle Paperwhite 1
Wow! Vielen Dank, aerodynamik, endlich funktioniert es!
Dereks is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe Request for Handelsblatt [GER] Moik Recipes 6 10-15-2010 07:13 PM


All times are GMT -4. The time now is 06:11 AM.


MobileRead.com is a privately owned, operated and funded community.