Need help with le_temps.recipe

rk43 · 01-30-2014, 05:47 AM

I cannot get the builtin recipe for "Le Temps" to work properly.

Perhaps am I doing something stupid as I had never tried the news part of calibre in the past. But I got some other recipes to work properly.

I noticed that the structure of the pages has probably changed since the creation of the recipe (2009). I therefore tried to adjust the recipe but had no succes at all.

I also noticed that the rss from "Le Temps" are redirected to rss.feedsportal.com, a site providing statistical information to online publishers.

I then started to susppect a login problem. Note that the login through get_browser seems to work, at least I receive the proper error message if I provide a wrong password.

I therefore tried to compare the epub produced with and without login (set needs_subscription to False end removed the overload of get browser). I was very surprised to see that the produced epub are exactly the same. The table of content is properly built but the pages themselves contain only the title and the teaser, plus some icons probably added by rss.feedsportal.com (see the attached epub).

I suspect, but have no proof, that the cookies returned by the login process are not stored properly, perhaps with the wrong url.

Now, I don't know what to do next and where to look. Any help wold be greatly appreciated.

R. Kessi

kovidgoyal · 01-30-2014, 07:35 AM

remove keep_only_tags and remove_tags and add use_embedded_content = False

Then you should get the ful apge downloads, which you can cleanup by adding the keep_only_tags and remove_tags back one by one.

rk43 · 01-30-2014, 09:02 AM

Thank you very much Kovid for your fast response.

I had already removed the keep_only_tags and remove_tags. However I completely missed the use_embedded_content = False.

Now I get the pages. Next week, I hope to find some time to work on the page formating. When I have something pleasant to me, I will post here the modified recipe.

rk43 · 02-22-2014, 08:57 AM

I have made some progress in fixing le_temps. recipe. Although it certainly can still be improved, this one works fine for me. This is why I submit it here.

If anybody has suggestions to improve it, I would be glad to hear about these.

Spoiler:

Code:

#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import with_statement

__license__   = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

#-------------------------------
#   Modified by Roland Kessi - February 2014
#-------------------------------

from calibre.web.feeds.news import BasicNewsRecipe

class LeTemps(BasicNewsRecipe):
    title          = u'Le Temps'
    oldest_article = 7
    max_articles_per_feed = 100
    __author__ = 'Kovid Goyal'
    description = 'French news. Needs a subscription from http://www.letemps.ch'
    no_stylesheets = True
    remove_javascript = True
    recursions = 1
    encoding = 'UTF-8'
    match_regexps = [r'http://www.letemps.ch/Page/Uuid/[-0-9a-f]+\|[1-9]']     
    language = 'fr'
    needs_subscription = True
    simultaneous_downloads = 5
    use_embedded_content = False
    remove_empty_feeds = True
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        br.open('http://www.letemps.ch/login')
        br.select_form(nr=1)
        br['username'] = self.username
        br['password'] = self.password
        raw = br.submit().read()
        if '>Login' in raw:
            raise ValueError('Failed to login to letemp.ch. Check '
                'your username and password')
        return br

    def get_article_url(self, article):
        '''
            Override in a subclass to customize extraction of the :term:`URL` that points
            to the content for each article. Return the
            article URL. It is called with `article`, an object representing a parsed article
            from a feed. See `feedparser <http://packages.python.org/feedparser/>`_.
            By default it looks for the original link (for feeds syndicated via a
            service like feedburner or pheedo) and if found,
            returns that or else returns
            `article.link <http://packages.python.org/feedparser/reference-entry-link.html>`_.
        '''
        #=======================================================================
        # Avoid going through http://rss.feedsportal.com/...
        #=======================================================================
        for key in article.keys():
            if key.endswith('_origlink'):
                url = article[key]
                if url and url.startswith('http://'):
                    print ('Url is :', url)
                    return url
        ans = article.get('link', None)
        if not ans and getattr(article, 'links', None):
            for item in article.links:
                if item.get('rel', 'alternate') == 'alternate':
                    ans = item['href']
                    break
        pos = ans.find('letemps0Bch')
        ans = 'http://www.' + ans[pos:]
        ans = ans.replace('0A', '0')
        ans = ans.replace('0B', '.')
        ans = ans.replace('0C', '/')
        ans = ans.replace('0E', '-')
        return ans


    keep_only_tags =    [
                            dict(name='div', attrs={'id':'content'}),
                        ]
    remove_tags    =    [
                            dict(name='div', attrs={'id':'html5_gallery'}),
                            dict(name='ul', attrs={'class':['tabs']}),
                            dict(name='img', attrs={'class':['bigImg']}),
                            dict(name='div', attrs={'class':['box function','contentInserts','box banner','box additional','galleryOverview','position','rightAd','bottomAd','video',]}),
                        ]
     
    extra_css       =   '''
                        h1{font-family:"Georgia","Times New Roman",Times,serif;font-size:large;}
                        .headline{font-family:"Georgia","Times New Roman",Times,serif;font-size:large;color:#990000;}
                        .summary_gal{color:#777777;font-family:"Georgia","Times New Roman",Times,serif;font-size:x-small;}
                        #capt{color:#1B1B1B;font-family:"Georgia","Times New Roman",Times,serif;font-size:x-small;}
                        #content{font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        .box.article.important{font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        #h2 {font-size: 24px; line-height: 25px; margin-bottom: 14px; text-transform:uppercase;}
                        .author {font-size:x-small; margin: 0 0 5px 0; color:#797971; font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        .lead {font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;font-weight: bold; margin: 10px 0;font-size:small;}
                        p {margin: 0 0 10px 0;}
                        h3{font-size:small;font-weight:bold;}
                        .description{font-size:x-small;font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;color:white; }
                        a {color:#1B1B1B; font-size:small;}
                        .linkbox{font-size:x-small;color:#1B1B1B;font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;} 

                        h2{font-size:small;font-weight:bold;font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        p.clear{clear:both;}
                        .heading{font-size:x-small;}
                        .heading strong{color:#940026;}
                        .box dd { clear:both; }
                        .box dl { position:relative; }
                        dl.caption {float:left;overflow:hidden;position:relative;margin: 0 10px 12px -40px;}
						.caption dd p,
                        .caption dt img { margin-right: 0;margin-bottom: 0;}
                        .caption dt img {float: left;}
                        .caption dd {width: 100%;bottom: -1px;position: absolute;}
                        .caption dd .description {z-index: 2;margin-left: 0px;padding: 3px 4px;position: relative;}
                        .caption dd .background {top: 0;left: 0;width: 100%;height: 100%;filter: alpha(opacity=70);opacity: 0.7;z-index: 1;position: absolute;background-color: black;}
                        '''
                         
    feeds          =    [
                            (u'Actualité', u'http://letemps.ch/rss/site/'),
                            (u'Actualité - Monde', u'http://letemps.ch/rss/site/actualite/monde'), 
                            (u'Actualité - Suisse & régions', u'http://letemps.ch/rss/site/actualite/suisse_regions'), 
                            (u'Actualité - Sport', u'http://letemps.ch/rss/site/actualite/sports'), 
                            (u'Actualité - Sciences & Environnement', u'http://letemps.ch/rss/site/actualite/sciences_environnement'), 
                            (u'Actualité - Multimédia', u'http://letemps.ch/rss/site/actualite/multimedia'), 
                            (u'Actualité - Société', u'http://letemps.ch/rss/site/actualite/societe'), 
                            (u'Actualité - Société | Quoi de neuf', u'http://letemps.ch/rss/site/actualite/societe/quoi_de_neuf'), 
                            (u'Economie & Finance', u'http://letemps.ch/rss/site/economie_finance'), 
                            (u'Economie & Finance - Finance', u'http://letemps.ch/rss/site/economie_finance/finance'), 
                            (u'Economie & Finance - Fonds de placement', u'http://letemps.ch/rss/site/economie_finance/fonds_placement'), 
                            (u'Economie & Finance - Carrières', u'http://letemps.ch/rss/site/economie_finance/carrieres'), 
                            (u'Culture', u'http://letemps.ch/rss/site/culture'), 
                            (u'Culture - Cinémas', u'http://letemps.ch/rss/site/culture/cinema'), 
                            (u'Culture - Musiques', u'http://letemps.ch/rss/site/culture/musiques'), 
                            (u'Culture - Scènes', u'http://letemps.ch/rss/site/culture/scenes'), 
                            (u'Culture - Arts plastiques', u'http://letemps.ch/rss/site/culture/arts_plastiques'), 
                            (u'Culture - Livres', u'http://letemps.ch/rss/site/culture/livres'), 
                            (u'Lifestyle - Luxe', u'http://letemps.ch/rss/site/lifestyle/luxe'), 
                            (u'Lifestyle - Mode', u'http://letemps.ch/rss/site/lifestyle/mode'), 
                            (u'Lifestyle - Horlogerie & Joaillerie', u'http://letemps.ch/rss/site/lifestyle/horlogerie_joaillerie'), 
                            (u'Lifestyle - Design', u'http://letemps.ch/rss/site/lifestyle/design'), 
                            (u'Lifestyle - Voyages', u'http://letemps.ch/rss/site/lifestyle/voyages'), 
                            (u'Lifestyle - Gastronomie', u'http://letemps.ch/rss/site/lifestyle/gastronomie'), 
                            (u'Lifestyle - Architecture & Immobilier', u'http://letemps.ch/rss/site/lifestyle/architecture_immobilier'), 
                            (u'Lifestyle - Automobile', u'http://letemps.ch/rss/site/lifestyle/automobile'), 
                            (u'Opinions', u'http://letemps.ch/rss/site/opinions'), 
                            (u'Opinions - Editoriaux', u'http://letemps.ch/rss/site/opinions/editoriaux'), 
                            (u'Opinions - Invités', u'http://letemps.ch/rss/site/opinions/invites'), 
                            (u'Opinions - Chroniques', u'http://letemps.ch/rss/site/opinions/chroniques'), 
                            (u'Opinions - Chappatte', u'http://letemps.ch/rss/site/opinions/chappatte')
                        ]

    def parse_feeds(self):
        feeds = BasicNewsRecipe.parse_feeds(self)
        for feed in feeds:
            del feed.description    # The title says it all and the descriptionhas has bad characters for "Le Temps"
        return feeds

    def postprocess_html(self, soup, first):
        for tag in soup.findAll('div', attrs = {'class':'box pagination'}):
            tag.extract()
        if not first:
            h = soup.find('h1')
            if h is not None:
                h.extract()
        print(soup.prettify())
        return soup

Also, if you have questions about my code, please feel free to post them below.

I also suggests that, after reviewing my code, Kovid replaces the built-in recipe by this one.

Roland Kessi

kovidgoyal · 02-22-2014, 12:25 PM

https://github.com/kovidgoyal/calibr...ab165c5afd6e3c

01-30-2014, 09:02 AM	#3
rk43 Junior Member Posts: 3 Karma: 10 Join Date: Jan 2014 Location: Switzerland Device: Android tablet - Aldiko	Thank you very much Kovid for your fast response. I had already removed the keep_only_tags and remove_tags. However I completely missed the use_embedded_content = False. Now I get the pages. Next week, I hope to find some time to work on the page formating. When I have something pleasant to me, I will post here the modified recipe. Last edited by rk43; 01-30-2014 at 09:11 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
FT Recipe 10/27	rainrdx	Recipes	6	10-29-2012 10:27 PM
Recipe help please	wmaurer	Recipes	0	04-23-2012 03:48 AM
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
Recipe for ng.pl	markoz	Recipes	4	04-13-2011 05:03 PM
New Recipe	UtahJames	Recipes	1	04-11-2011 04:15 PM

01-30-2014, 05:47 AM	#1
rk43 Junior Member Posts: 3 Karma: 10 Join Date: Jan 2014 Location: Switzerland Device: Android tablet - Aldiko	Need help with le_temps.recipe I cannot get the builtin recipe for "Le Temps" to work properly. Perhaps am I doing something stupid as I had never tried the news part of calibre in the past. But I got some other recipes to work properly. I noticed that the structure of the pages has probably changed since the creation of the recipe (2009). I therefore tried to adjust the recipe but had no succes at all. I also noticed that the rss from "Le Temps" are redirected to rss.feedsportal.com, a site providing statistical information to online publishers. I then started to susppect a login problem. Note that the login through get_browser seems to work, at least I receive the proper error message if I provide a wrong password. I therefore tried to compare the epub produced with and without login (set needs_subscription to False end removed the overload of get browser). I was very surprised to see that the produced epub are exactly the same. The table of content is properly built but the pages themselves contain only the title and the teaser, plus some icons probably added by rss.feedsportal.com (see the attached epub). I suspect, but have no proof, that the cookies returned by the login process are not stored properly, perhaps with the wrong url. Now, I don't know what to do next and where to look. Any help wold be greatly appreciated. R. Kessi

01-30-2014, 07:35 AM	#2
kovidgoyal creator of calibre Posts: 45,410 Karma: 27757236 Join Date: Oct 2006 Location: Mumbai, India Device: Various	remove keep_only_tags and remove_tags and add use_embedded_content = False Then you should get the ful apge downloads, which you can cleanup by adding the keep_only_tags and remove_tags back one by one.

02-22-2014, 12:25 PM	#5
kovidgoyal creator of calibre Posts: 45,410 Karma: 27757236 Join Date: Oct 2006 Location: Mumbai, India Device: Various	https://github.com/kovidgoyal/calibr...ab165c5afd6e3c

Advert

Advert