View Single Post
Old 08-18-2011, 04:02 PM   #2
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by emai7s2 View Post
I think Liberation changed something on their website this past weekend. Since then, Calibre only downloads news headlines from Liberation without the accompanying articles.
Here is a quick fix to use pending any revision by the author. It probably still retains lines which have now become redundant. It also seems to find a few headlines which I do not see in the RSS feeds - possibly photo features?

Spoiler:
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Darko Miletic <darko.miletic at gmail.com>'
'''
liberation.fr
'''

from calibre.web.feeds.news import BasicNewsRecipe

class Liberation(BasicNewsRecipe):
    title                 = u'Liberation'
    __author__            = 'Darko Miletic'
    description           = 'News from France'
    language = 'fr'

    oldest_article        = 7
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    
    html2lrf_options = ['--base-font-size', '10']

    keep_only_tags    = [
                           dict(name='h1')
                          #,dict(name='div', attrs={'class':'object-content text text-item'})
                          ,dict(name='div', attrs={'class':'article'})
                          #,dict(name='div', attrs={'class':'articleContent'})
                          ,dict(name='div', attrs={'class':'entry'})
                        ]
    remove_tags_after = [ dict(name='div',attrs={'class':'toolbox extra_toolbox'}) ]
    remove_tags    = [
                        dict(name='p', attrs={'class':'clear'})
                       ,dict(name='ul', attrs={'class':'floatLeft clear'})
                       ,dict(name='div', attrs={'class':'clear floatRight'})
                       ,dict(name='object')
                       ,dict(name='div', attrs={'class':'toolbox'})
                       ,dict(name='div', attrs={'class':'cartridge cartridge-basic-bubble cat-zoneabo'})
                       #,dict(name='div', attrs={'class':'clear block block-call-items'})
                       ,dict(name='div', attrs={'class':'block-content'})
                     ]
    
    feeds          = [
                         (u'La une', u'http://www.liberation.fr/rss/laune')
                        ,(u'Monde' , u'http://www.liberation.fr/rss/monde')
                        ,(u'Sports', u'http://www.liberation.fr/rss/sports')
                     ]
oneillpt is offline   Reply With Quote