Thread: FAZ-Net Update
View Single Post
Old 01-14-2014, 02:36 PM   #8
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
A new update for this recipe.

I made a stupid mistake. Now the recipe will work again and I use the first time postprocess (Kovid, thanks for your hint).

Spoiler:
Code:
__license__   = 'GPL v3'
__copyright__ = '2008-2011, Kovid Goyal <kovid at kovidgoyal.net>, Darko Miletic <darko at gmail.com>'
'''
Profile to download FAZ.NET
'''

from calibre.web.feeds.news import BasicNewsRecipe

class FazNet(BasicNewsRecipe):
    title                 = 'FAZ.NET'
    __author__            = 'Kovid Goyal, Darko Miletic, Armin Geller' # AGe upd. V4 2014-01-14
    description           = 'Frankfurter Allgemeine Zeitung'
    publisher             = 'Frankfurter Allgemeine Zeitung GmbH'
    category              = 'news, politics, Germany'
    use_embedded_content  = False
    language = 'de'
    
    max_articles_per_feed = 30
    no_stylesheets        = True
    encoding              = 'utf-8'
    remove_javascript     = True

    keep_only_tags = [{'class':'FAZArtikelEinleitung'},
            {'id':'ArtikelTabContent_0'}]

    remove_tags_after = [dict(name='div', attrs={'class':['ArtikelFooter']})]
    remove_tags = [dict(name='div', attrs={'class':['ArtikelFooter']})]

#    recursions = 1                        # AGe 2014-01-10
#    match_regexps = [r'-p[2-9].html$']    # AGe 2014-01-10
                  
    feeds = [
              ('FAZ.NET Aktuell', 'http://www.faz.net/aktuell/?rssview=1'),
              ('Politik', 'http://www.faz.net/aktuell/politik/?rssview=1'),
              ('Wirtschaft', 'http://www.faz.net/aktuell/wirtschaft/?rssview=1'),
              ('Feuilleton', 'http://www.faz.net/aktuell/feuilleton/?rssview=1'),
              ('Sport', 'http://www.faz.net/aktuell/sport/?rssview=1'),
              ('Lebensstil', 'http://www.faz.net/aktuell/lebensstil/?rssview=1'),
              ('Gesellschaft', 'http://www.faz.net/aktuell/gesellschaft/?rssview=1'),
              ('Finanzen', 'http://www.faz.net/aktuell/finanzen/?rssview=1'),
              ('Technik & Motor', 'http://www.faz.net/aktuell/technik-motor/?rssview=1'),
              ('Wissen', 'http://www.faz.net/aktuell/wissen/?rssview=1'),
              ('Reise', 'http://www.faz.net/aktuell/reise/?rssview=1'),
              ('Beruf & Chance', 'http://www.faz.net/aktuell/beruf-chance/?rssview=1'),
              ('Rhein-Main', 'http://www.faz.net/aktuell/rhein-main/?rssview=1')
            ]

# AGe 2014-01-10 New  for multipages
    INDEX                 = ''
    def append_page(self, soup, appendtag, position):   # AGe upd 2014-01-14
        pager = soup.find('a',attrs={'title':'Nächste Seite'})
        if pager:
           nexturl = self.INDEX + pager['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'FAZArtikelContent'})
           texttag.find('div', attrs={'class':'ArtikelFooter'}).extract()
           texttag.find('div', attrs={'class':'ArtikelAbbinder'}).extract()
           texttag.find('div', attrs={'class':'ArtikelKommentieren Artikelfuss GETS;tk;boxen.top-lesermeinungen;tp;content'}).extract()
           texttag.find('div', attrs={'class':'Anzeige GoogleAdsBuehne'}).extract()
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           pager.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):                    # AGe upd 2014-01-14
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup) 
        
    def postprocess_html(self, soup, first_fetch):      # AGe add 2014-01-14
        for div in soup.findAll(id='ArticlePagerBottom'):
          div.extract()
        return soup


Let me know, if there are any issues with this version.
Attached Files
File Type: zip faznet_AGe V5.zip (1.3 KB, 302 views)
Divingduck is offline   Reply With Quote