View Single Post
Old 03-08-2011, 03:45 PM   #11
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by miwie View Post
Two suggestions for improvement:
  1. Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    (e.g. before the INDEX line). This adds the logo of El Pais at the top
    of the feed overview (which contains only one feed in this case)
  2. Activate 'no_stylesheets = True' (there are articles with '<style ...'
    after the article content which gets included in the EPUB otherwise)
The name of the feed appears as "Unknown feed' which should be renamed somehow.

Good work!
I've added the masthead_url as suggested, and activated 'no_stylesheets = True' option, although the styles do not seem to make any noticeable difference in this case.

I've also addressed the "Unknown feed" by replacing a missing title by "Babelia Feed". The revised recipe, with logging for the section title and url extraction, is now:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ElPaisBabelia(BasicNewsRecipe):

    title      = 'El Pais Babelia'
    __author__ = 'oneillpt'
    description = 'El Pais Babelia'
    masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.elpais.com/suple/babelia/'
    language = 'es'

    remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        dict(name='div', attrs={'id':'utilidades'}),
        dict(name='div', attrs={'class':'info_relacionada'}),
        dict(name='div', attrs={'class':'mod_apoyo'}),
        dict(name='div', attrs={'class':'contorno_f'}),
        dict(name='div', attrs={'class':'pestanias'}),
        dict(name='div', attrs={'class':'otros_webs'}),
        dict(name='div', attrs={'id':'pie'})
        ]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        seen_titles = set([])
        for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
            section_title = self.tag_to_string(section.find('h1'))
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = "Babelia Feed"
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.elpais.es'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds
oneillpt is offline   Reply With Quote