Recipe for Focus (DE)

xXxXxXxXxXx · 05-21-2011, 12:53 PM

Code:

class AdvancedUserRecipe1305567197(BasicNewsRecipe):
    title          = u'Focus (DE)'
    __author__  = 'xXxXxXxXxXx'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets         = True
    use_embedded_content   = False
    remove_javascript      = True
    
    def print_version(self, url):
        return url + '?drucken=1'
    
    keep_only_tags = [
                              dict(name='div', attrs={'id':['article']}) ]

    remove_tags = [dict(name='div', attrs={'class':'sidebar'}),
                            dict(name='div', attrs={'class':'commentForm'}),
                            dict(name='div', attrs={'class':'comment clearfix oid-3534591 open'}),
                            dict(name='div', attrs={'class':'similarityBlock'}),
                            dict(name='div', attrs={'class':'footer'}),
                            dict(name='div', attrs={'class':'getMoreComments'}),
                            dict(name='div', attrs={'class':'moreComments'}),  
                            dict(name='div', attrs={'class':'ads'}),
                            dict(name='div', attrs={'class':'articleContent'}),

                            
                            ]
    remove_tags_after = [
                            dict(name='div',attrs={'class':['commentForm','title', 'actions clearfix']})
                                   ]
                            
   
    feeds          = [	(u'Eilmeldungen', u'http://rss2.focus.de/c/32191/f/533875/index.rss'),
                                        (u'Auto-News', u'http://rss2.focus.de/c/32191/f/443320/index.rss'),
                                        (u'Digital-News', u'http://rss2.focus.de/c/32191/f/443315/index.rss'),
                                        (u'Finanzen-News', u'http://rss2.focus.de/c/32191/f/443317/index.rss'),
                                        (u'Gesundheit-News', u'http://rss2.focus.de/c/32191/f/443314/index.rss'),
                                        (u'Immobilien-News', u'http://rss2.focus.de/c/32191/f/443318/index.rss'),
                                        (u'Kultur-News', u'http://rss2.focus.de/c/32191/f/443321/index.rss'),
		(u'Panorama-News', u'http://rss2.focus.de/c/32191/f/533877/index.rss'),
                                        (u'Politik-News', u'http://rss2.focus.de/c/32191/f/443313/index.rss'),
                                        (u'Reisen-News', u'http://rss2.focus.de/c/32191/f/443316/index.rss'),
                                        (u'Sport-News', u'http://rss2.focus.de/c/32191/f/443319/index.rss'),
                                        (u'Wissen-News', u'http://rss2.focus.de/c/32191/f/533876/index.rss'),
                         ]

schuster · 05-21-2011, 01:50 PM

hi,
sorry, but this had to be a recipe for multipage (articel: Astronomie: Der erdähnlichste Exoplanet).

xXxXxXxXxXx · 05-21-2011, 04:05 PM

Unfortunately but J don't know how to do recipe for multi pages, this is for me to complicated maybe author of calibre change some in API of calibre to make it a lot easier.
or some one write tutorial (very easy)

Starson17 · 05-23-2011, 09:45 AM

Quote:

Originally Posted by xXxXxXxXxXx

Unfortunately but J don't know how to do recipe for multi pages, this is for me to complicated maybe author of calibre change some in API of calibre to make it a lot easier.
or some one write tutorial (very easy)

They are not hard, but you need to grab a sample, read for what you understand, then ask questions. Basically, a multipage does this:
1) it finds the link to the "next page"
2) it goes to that page and gets everything on it that the recipe author wants.
3) it pastes that stuff into the first page.
4) it repeats 1-3 until there is no "next page" link.

The recursion is tricky to understand, but it's easy to copy the multipage code, which is already set up, and almost every multipage recipe is the same and copies the same bit of multipage code, except for the part about what tag is used to find the "next page" and the part about what part of the next page to keep.

Start with the Adventure Gamer recipe, copy the whole thing here, then ask questions.

schuster · 05-23-2011, 02:53 PM

hi starson, I hope I can also ask this questions?

you are right. but i don't understand it.
i'm experimenting without success.

Code:

class AdvancedUserRecipe1305567197(BasicNewsRecipe):
    title          = u'Focus - test'
    __author__  = 'for_test'
    oldest_article = 20
    max_articles_per_feed = 10
    no_stylesheets         = True
    use_embedded_content   = False
    remove_javascript      = True
    

    def get_article_url(self, article):
        return article.get('id', article.get('guid', None))


    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'nextPage greyButton'}) # here is pager
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'textBlock'}) # here is text
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)


    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        for item in soup.findAll('span', attrs={'class':'overhead'}): # here is bevor textblock
            item.extract()
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'pageCounter'}) # this is pager on next side
        if pager:
           pager.extract()
        return self.adeify_images(soup)


    feeds          = [	(u'Eilmeldungen', u'http://rss2.focus.de/c/32191/f/533875/index.rss'),
                                        (u'Wissen-News', u'http://rss2.focus.de/c/32191/f/533876/index.rss')]

# feed with multipage in "wissen-news":
# Ozonloch-Studie - Zwischen Euphorie und Hysterie

is this right? but i've got no luck to grab it.
it grabs only the normal pages, the multipages are lost.

greetings

xXxXxXxXxXx · 05-24-2011, 01:55 PM

J hope that someone finally create recipe for this website, because the best way of learning is learning on examples.

So maybe you Starson17 create this recipe ?

Starson17 · 05-26-2011, 01:34 PM

Quote:

Originally Posted by schuster

hi starson, I hope I can also ask this questions?
is this right?

I do not have much time, but post a link to a multipage article, and I will look at it. I did not see any in my brief look.

Is "pager" ever found? IOW, is this if code block ever entered?:

Code:

if pager:

schuster · 05-26-2011, 01:51 PM

hi starson,
here a link to an articel that use multipage.

Code:

http://rss2.focus.de/c/32191/f/533876/s/151d269a/l/0L0Sfocus0Bde0Cwissen0Cwissenschaft0Cmeteorologie0Ctid0E224240Ctornados0Edie0Eden0Esturm0Ejagen0Iaid0I630A10A40Bhtml/story01.htm

Quote:

Is "pager" ever found?

i think so, but the re-insert seems not to work.

Starson17 · 05-26-2011, 03:45 PM

I see 2 problems.
1) You use self.INDEX in your recipe, but it is not defined.
2) I ran the recipe with that removed, and it found instances of pager:

Code:

pager = soup.find('a',attrs={'class':'nextPage greyButton'})

Where there was no <a> element with href attribute.

Code:

pager.a['href']

Until these are fixed, it won't work. The pager is a tag that includes what you need for building a link to the next page. It must only be found on pages that are multipage. You must find or create the link to next page (using INDEX plus href attribute or whatever) from pager. Pager must never be found on the last page of the multipage article (this tells it when it is done building the entire article).

I cannot read German, so can only guess at how to do this.

Aimylios · 05-08-2016, 04:24 AM

Hi,

the addresses of the focus.de RSS feeds have been changed. Here's an updated version of the focus_de.recipe.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function

'''
focus.de
'''

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1305567197(BasicNewsRecipe):
    title       = 'Focus (DE)'
    __author__  = 'Anonymous'
    description = 'RSS-Feeds von Focus.de'
    language    = 'de'

    oldest_article            = 7
    max_articles_per_feed     = 100
    no_stylesheets            = True
    remove_javascript         = True
    use_embedded_content      = False
    remove_empty_feeds        = True
    ignore_duplicate_articles = {'title', 'url'}

    feeds = [
        ('Politik', 'http://rss.focus.de/politik/'),
        ('Finanzen', 'http://rss.focus.de/finanzen/'),
        ('Gesundheit', 'http://rss.focus.de/gesundheit/'),
        ('Panorama', 'http://rss.focus.de/panorama/'),
        ('Digital', 'http://rss.focus.de/digital/'),
        ('Reisen', 'http://rss.focus.de/reisen/')
    ]

    keep_only_tags = [
        dict(name='div', attrs={'id':'article'})
    ]

    remove_tags = [
        dict(name='div', attrs={'class':['inimagebuttons',
                                         'kolumneHead clearfix']})
    ]

    remove_attributes = ['width', 'height']

    extra_css = 'h1 {font-size: 1.6em; text-align: left; margin-top: 0em} \
                 h2 {font-size: 1em; text-align: left} \
                 .overhead {margin-bottom: 0em} \
                 .caption {font-size: 0.6em}'

    def print_version(self, url):
        return url + '?drucken=1'

    def preprocess_html(self, soup):
        # remove useless references to videos
        for item in soup.findAll('h2'):
            if item.string:
                txt = item.string.upper()
                if txt.startswith('IM VIDEO:') or txt.startswith('VIDEO:'):
                    item.extract()
        return soup

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Still losing focus	JKenP	Calibre	4	05-27-2011 08:17 AM
Focus on First Wave of E-book Marketing	DMcCunney	News	6	12-18-2010 07:45 PM
Focus annoyance	edbro	Calibre	2	10-05-2010 06:07 PM
Focus not properly shifting on links	JSWolf	Feedback	9	08-14-2010 06:12 PM
Focus the reply message bo	kovidgoyal	Feedback	9	02-11-2009 03:30 AM

05-21-2011, 01:50 PM	#2
schuster Zealot Posts: 119 Karma: 100 Join Date: Jan 2011 Location: Germany / NRW /Köln Device: prs-650 / prs-350 /kindle 3	hi, sorry, but this had to be a recipe for multipage (articel: Astronomie: Der erdähnlichste Exoplanet).

05-21-2011, 04:05 PM	#3
xXxXxXxXxXx Enthusiast Posts: 37 Karma: 10 Join Date: Apr 2011 Device: none	Unfortunately but J don't know how to do recipe for multi pages, this is for me to complicated maybe author of calibre change some in API of calibre to make it a lot easier. or some one write tutorial (very easy)

05-24-2011, 01:55 PM	#6
xXxXxXxXxXx Enthusiast Posts: 37 Karma: 10 Join Date: Apr 2011 Device: none	J hope that someone finally create recipe for this website, because the best way of learning is learning on examples. So maybe you Starson17 create this recipe ?

05-26-2011, 03:45 PM	#9
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	I see 2 problems. 1) You use self.INDEX in your recipe, but it is not defined. 2) I ran the recipe with that removed, and it found instances of pager: Code: pager = soup.find('a',attrs={'class':'nextPage greyButton'}) Where there was no <a> element with href attribute. Code: pager.a['href'] Until these are fixed, it won't work. The pager is a tag that includes what you need for building a link to the next page. It must only be found on pages that are multipage. You must find or create the link to next page (using INDEX plus href attribute or whatever) from pager. Pager must never be found on the last page of the multipage article (this tells it when it is done building the entire article). I cannot read German, so can only guess at how to do this.