Thread: FAZ-Net Update
View Single Post
Old 01-10-2014, 01:35 PM   #5
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Please find attached a new version of the recipe for multipage articles.

Spoiler:
Code:
__license__   = 'GPL v3'
__copyright__ = '2008-2011, Kovid Goyal <kovid at kovidgoyal.net>, Darko Miletic <darko at gmail.com>'
'''
Profile to download FAZ.NET
'''

from calibre.web.feeds.news import BasicNewsRecipe

class FazNet(BasicNewsRecipe):
    title                 = 'FAZ.NET'
    __author__            = 'Kovid Goyal, Darko Miletic, Armin Geller' # AGe upd. V4 2014-01-10
    description           = 'Frankfurter Allgemeine Zeitung'
    publisher             = 'Frankfurter Allgemeine Zeitung GmbH'
    category              = 'news, politics, Germany'
    use_embedded_content  = False
    language = 'de'
    
    max_articles_per_feed = 30
    no_stylesheets        = True
    encoding              = 'utf-8'
    remove_javascript     = True

    keep_only_tags = [{'class':'FAZArtikelEinleitung'},
            {'id':'ArtikelTabContent_0'}]

    remove_tags_after = [dict(name='div', attrs={'class':['ArtikelFooter']})]
    remove_tags = [dict(name='div', attrs={'class':['ArtikelFooter']})]

#    recursions = 1                        # AGe 2014-01-10
#    match_regexps = [r'-p[2-9].html$']    # AGe 2014-01-10
                  
    feeds = [
              ('FAZ.NET Aktuell', 'http://www.faz.net/aktuell/?rssview=1'),
              ('Politik', 'http://www.faz.net/aktuell/politik/?rssview=1'),
              ('Wirtschaft', 'http://www.faz.net/aktuell/wirtschaft/?rssview=1'),
              ('Feuilleton', 'http://www.faz.net/aktuell/feuilleton/?rssview=1'),
              ('Sport', 'http://www.faz.net/aktuell/sport/?rssview=1'),
              ('Lebensstil', 'http://www.faz.net/aktuell/lebensstil/?rssview=1'),
              ('Gesellschaft', 'http://www.faz.net/aktuell/gesellschaft/?rssview=1'),
              ('Finanzen', 'http://www.faz.net/aktuell/finanzen/?rssview=1'),
              ('Technik & Motor', 'http://www.faz.net/aktuell/technik-motor/?rssview=1'),
              ('Wissen', 'http://www.faz.net/aktuell/wissen/?rssview=1'),
              ('Reise', 'http://www.faz.net/aktuell/reise/?rssview=1'),
              ('Beruf & Chance', 'http://www.faz.net/aktuell/beruf-chance/?rssview=1'),
              ('Rhein-Main', 'http://www.faz.net/aktuell/rhein-main/?rssview=1')
            ]

# AGe 2014-01-10 New  for multipages
    INDEX                 = ''
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'title':'Nächste Seite'})
        if pager:
           nexturl = self.INDEX + pager['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'FAZArtikelContent'})
           texttag.find('div', attrs={'class':'ArtikelFooter'}).extract()
           texttag.find('div', attrs={'class':'ArtikelAbbinder'}).extract()
           texttag.find('div', attrs={'class':'ArtikelKommentieren Artikelfuss GETS;tk;boxen.top-lesermeinungen;tp;content'}).extract()
           texttag.find('div', attrs={'class':'Anzeige GoogleAdsBuehne'}).extract()
           texttag.find('div', attrs={'id':'ArticlePagerBottom'}).extract()           
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           pager.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'id':'ArticlePagerBottom'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)

Let me know, if there are any issues with this version.
Attached Files
File Type: zip faznet_AGe V4.zip (1.3 KB, 277 views)
Divingduck is offline   Reply With Quote