Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-30-2014, 05:47 AM   #1
rk43
Junior Member
rk43 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2014
Location: Switzerland
Device: Android tablet - Aldiko
Need help with le_temps.recipe

I cannot get the builtin recipe for "Le Temps" to work properly.

Perhaps am I doing something stupid as I had never tried the news part of calibre in the past. But I got some other recipes to work properly.

I noticed that the structure of the pages has probably changed since the creation of the recipe (2009). I therefore tried to adjust the recipe but had no succes at all.

I also noticed that the rss from "Le Temps" are redirected to rss.feedsportal.com, a site providing statistical information to online publishers.

I then started to susppect a login problem. Note that the login through get_browser seems to work, at least I receive the proper error message if I provide a wrong password.

I therefore tried to compare the epub produced with and without login (set needs_subscription to False end removed the overload of get browser). I was very surprised to see that the produced epub are exactly the same. The table of content is properly built but the pages themselves contain only the title and the teaser, plus some icons probably added by rss.feedsportal.com (see the attached epub).

I suspect, but have no proof, that the cookies returned by the login process are not stored properly, perhaps with the wrong url.

Now, I don't know what to do next and where to look. Any help wold be greatly appreciated.

R. Kessi
rk43 is offline   Reply With Quote
Old 01-30-2014, 07:35 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,296
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
remove keep_only_tags and remove_tags and add use_embedded_content = False

Then you should get the ful apge downloads, which you can cleanup by adding the keep_only_tags and remove_tags back one by one.
kovidgoyal is offline   Reply With Quote
Advert
Old 01-30-2014, 09:02 AM   #3
rk43
Junior Member
rk43 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2014
Location: Switzerland
Device: Android tablet - Aldiko
Thank you very much Kovid for your fast response.

I had already removed the keep_only_tags and remove_tags. However I completely missed the use_embedded_content = False.

Now I get the pages. Next week, I hope to find some time to work on the page formating. When I have something pleasant to me, I will post here the modified recipe.

Last edited by rk43; 01-30-2014 at 09:11 AM.
rk43 is offline   Reply With Quote
Old 02-22-2014, 08:57 AM   #4
rk43
Junior Member
rk43 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2014
Location: Switzerland
Device: Android tablet - Aldiko
A new version of le_temps.recipe

I have made some progress in fixing le_temps. recipe. Although it certainly can still be improved, this one works fine for me. This is why I submit it here.

If anybody has suggestions to improve it, I would be glad to hear about these.

Spoiler:
Code:
#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from __future__ import with_statement

__license__   = 'GPL v3'
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

#-------------------------------
#   Modified by Roland Kessi - February 2014
#-------------------------------

from calibre.web.feeds.news import BasicNewsRecipe

class LeTemps(BasicNewsRecipe):
    title          = u'Le Temps'
    oldest_article = 7
    max_articles_per_feed = 100
    __author__ = 'Kovid Goyal'
    description = 'French news. Needs a subscription from http://www.letemps.ch'
    no_stylesheets = True
    remove_javascript = True
    recursions = 1
    encoding = 'UTF-8'
    match_regexps = [r'http://www.letemps.ch/Page/Uuid/[-0-9a-f]+\|[1-9]']     
    language = 'fr'
    needs_subscription = True
    simultaneous_downloads = 5
    use_embedded_content = False
    remove_empty_feeds = True
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        br.open('http://www.letemps.ch/login')
        br.select_form(nr=1)
        br['username'] = self.username
        br['password'] = self.password
        raw = br.submit().read()
        if '>Login' in raw:
            raise ValueError('Failed to login to letemp.ch. Check '
                'your username and password')
        return br

    def get_article_url(self, article):
        '''
            Override in a subclass to customize extraction of the :term:`URL` that points
            to the content for each article. Return the
            article URL. It is called with `article`, an object representing a parsed article
            from a feed. See `feedparser <http://packages.python.org/feedparser/>`_.
            By default it looks for the original link (for feeds syndicated via a
            service like feedburner or pheedo) and if found,
            returns that or else returns
            `article.link <http://packages.python.org/feedparser/reference-entry-link.html>`_.
        '''
        #=======================================================================
        # Avoid going through http://rss.feedsportal.com/...
        #=======================================================================
        for key in article.keys():
            if key.endswith('_origlink'):
                url = article[key]
                if url and url.startswith('http://'):
                    print ('Url is :', url)
                    return url
        ans = article.get('link', None)
        if not ans and getattr(article, 'links', None):
            for item in article.links:
                if item.get('rel', 'alternate') == 'alternate':
                    ans = item['href']
                    break
        pos = ans.find('letemps0Bch')
        ans = 'http://www.' + ans[pos:]
        ans = ans.replace('0A', '0')
        ans = ans.replace('0B', '.')
        ans = ans.replace('0C', '/')
        ans = ans.replace('0E', '-')
        return ans


    keep_only_tags =    [
                            dict(name='div', attrs={'id':'content'}),
                        ]
    remove_tags    =    [
                            dict(name='div', attrs={'id':'html5_gallery'}),
                            dict(name='ul', attrs={'class':['tabs']}),
                            dict(name='img', attrs={'class':['bigImg']}),
                            dict(name='div', attrs={'class':['box function','contentInserts','box banner','box additional','galleryOverview','position','rightAd','bottomAd','video',]}),
                        ]
     
    extra_css       =   '''
                        h1{font-family:"Georgia","Times New Roman",Times,serif;font-size:large;}
                        .headline{font-family:"Georgia","Times New Roman",Times,serif;font-size:large;color:#990000;}
                        .summary_gal{color:#777777;font-family:"Georgia","Times New Roman",Times,serif;font-size:x-small;}
                        #capt{color:#1B1B1B;font-family:"Georgia","Times New Roman",Times,serif;font-size:x-small;}
                        #content{font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        .box.article.important{font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        #h2 {font-size: 24px; line-height: 25px; margin-bottom: 14px; text-transform:uppercase;}
                        .author {font-size:x-small; margin: 0 0 5px 0; color:#797971; font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        .lead {font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;font-weight: bold; margin: 10px 0;font-size:small;}
                        p {margin: 0 0 10px 0;}
                        h3{font-size:small;font-weight:bold;}
                        .description{font-size:x-small;font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;color:white; }
                        a {color:#1B1B1B; font-size:small;}
                        .linkbox{font-size:x-small;color:#1B1B1B;font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;} 

                        h2{font-size:small;font-weight:bold;font-family:"Lucida Grande","Lucida Sans Unicode",Arial,Verdana,sans-serif;}
                        p.clear{clear:both;}
                        .heading{font-size:x-small;}
                        .heading strong{color:#940026;}
                        .box dd { clear:both; }
                        .box dl { position:relative; }
                        dl.caption {float:left;overflow:hidden;position:relative;margin: 0 10px 12px -40px;}
						.caption dd p,
                        .caption dt img { margin-right: 0;margin-bottom: 0;}
                        .caption dt img {float: left;}
                        .caption dd {width: 100%;bottom: -1px;position: absolute;}
                        .caption dd .description {z-index: 2;margin-left: 0px;padding: 3px 4px;position: relative;}
                        .caption dd .background {top: 0;left: 0;width: 100%;height: 100%;filter: alpha(opacity=70);opacity: 0.7;z-index: 1;position: absolute;background-color: black;}
                        '''
                         
    feeds          =    [
                            (u'Actualité', u'http://letemps.ch/rss/site/'),
                            (u'Actualité - Monde', u'http://letemps.ch/rss/site/actualite/monde'), 
                            (u'Actualité - Suisse & régions', u'http://letemps.ch/rss/site/actualite/suisse_regions'), 
                            (u'Actualité - Sport', u'http://letemps.ch/rss/site/actualite/sports'), 
                            (u'Actualité - Sciences & Environnement', u'http://letemps.ch/rss/site/actualite/sciences_environnement'), 
                            (u'Actualité - Multimédia', u'http://letemps.ch/rss/site/actualite/multimedia'), 
                            (u'Actualité - Société', u'http://letemps.ch/rss/site/actualite/societe'), 
                            (u'Actualité - Société | Quoi de neuf', u'http://letemps.ch/rss/site/actualite/societe/quoi_de_neuf'), 
                            (u'Economie & Finance', u'http://letemps.ch/rss/site/economie_finance'), 
                            (u'Economie & Finance - Finance', u'http://letemps.ch/rss/site/economie_finance/finance'), 
                            (u'Economie & Finance - Fonds de placement', u'http://letemps.ch/rss/site/economie_finance/fonds_placement'), 
                            (u'Economie & Finance - Carrières', u'http://letemps.ch/rss/site/economie_finance/carrieres'), 
                            (u'Culture', u'http://letemps.ch/rss/site/culture'), 
                            (u'Culture - Cinémas', u'http://letemps.ch/rss/site/culture/cinema'), 
                            (u'Culture - Musiques', u'http://letemps.ch/rss/site/culture/musiques'), 
                            (u'Culture - Scènes', u'http://letemps.ch/rss/site/culture/scenes'), 
                            (u'Culture - Arts plastiques', u'http://letemps.ch/rss/site/culture/arts_plastiques'), 
                            (u'Culture - Livres', u'http://letemps.ch/rss/site/culture/livres'), 
                            (u'Lifestyle - Luxe', u'http://letemps.ch/rss/site/lifestyle/luxe'), 
                            (u'Lifestyle - Mode', u'http://letemps.ch/rss/site/lifestyle/mode'), 
                            (u'Lifestyle - Horlogerie & Joaillerie', u'http://letemps.ch/rss/site/lifestyle/horlogerie_joaillerie'), 
                            (u'Lifestyle - Design', u'http://letemps.ch/rss/site/lifestyle/design'), 
                            (u'Lifestyle - Voyages', u'http://letemps.ch/rss/site/lifestyle/voyages'), 
                            (u'Lifestyle - Gastronomie', u'http://letemps.ch/rss/site/lifestyle/gastronomie'), 
                            (u'Lifestyle - Architecture & Immobilier', u'http://letemps.ch/rss/site/lifestyle/architecture_immobilier'), 
                            (u'Lifestyle - Automobile', u'http://letemps.ch/rss/site/lifestyle/automobile'), 
                            (u'Opinions', u'http://letemps.ch/rss/site/opinions'), 
                            (u'Opinions - Editoriaux', u'http://letemps.ch/rss/site/opinions/editoriaux'), 
                            (u'Opinions - Invités', u'http://letemps.ch/rss/site/opinions/invites'), 
                            (u'Opinions - Chroniques', u'http://letemps.ch/rss/site/opinions/chroniques'), 
                            (u'Opinions - Chappatte', u'http://letemps.ch/rss/site/opinions/chappatte')
                        ]

    def parse_feeds(self):
        feeds = BasicNewsRecipe.parse_feeds(self)
        for feed in feeds:
            del feed.description    # The title says it all and the descriptionhas has bad characters for "Le Temps"
        return feeds

    def postprocess_html(self, soup, first):
        for tag in soup.findAll('div', attrs = {'class':'box pagination'}):
            tag.extract()
        if not first:
            h = soup.find('h1')
            if h is not None:
                h.extract()
        print(soup.prettify())
        return soup


Also, if you have questions about my code, please feel free to post them below.

I also suggests that, after reviewing my code, Kovid replaces the built-in recipe by this one.

Roland Kessi
rk43 is offline   Reply With Quote
Old 02-22-2014, 12:25 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,296
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
https://github.com/kovidgoyal/calibr...ab165c5afd6e3c
kovidgoyal is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
FT Recipe 10/27 rainrdx Recipes 6 10-29-2012 10:27 PM
Recipe help please wmaurer Recipes 0 04-23-2012 03:48 AM
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
Recipe for ng.pl markoz Recipes 4 04-13-2011 05:03 PM
New Recipe UtahJames Recipes 1 04-11-2011 04:15 PM


All times are GMT -4. The time now is 09:41 PM.


MobileRead.com is a privately owned, operated and funded community.