MobileRead Forums - View Single Post - How to convert newspaper which do not have RSS feed?

oneillpt · 03-08-2011, 03:45 PM

Quote:

Originally Posted by miwie

Two suggestions for improvement:

Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
(e.g. before the INDEX line). This adds the logo of El Pais at the top
of the feed overview (which contains only one feed in this case)
Activate 'no_stylesheets = True' (there are articles with '<style ...'
after the article content which gets included in the EPUB otherwise)

The name of the feed appears as "Unknown feed' which should be renamed somehow.

Good work!

I've added the masthead_url as suggested, and activated 'no_stylesheets = True' option, although the styles do not seem to make any noticeable difference in this case.

I've also addressed the "Unknown feed" by replacing a missing title by "Babelia Feed". The revised recipe, with logging for the section title and url extraction, is now:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ElPaisBabelia(BasicNewsRecipe):

    title      = 'El Pais Babelia'
    __author__ = 'oneillpt'
    description = 'El Pais Babelia'
    masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
    INDEX = 'http://www.elpais.com/suple/babelia/'
    language = 'es'

    remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        dict(name='div', attrs={'id':'utilidades'}),
        dict(name='div', attrs={'class':'info_relacionada'}),
        dict(name='div', attrs={'class':'mod_apoyo'}),
        dict(name='div', attrs={'class':'contorno_f'}),
        dict(name='div', attrs={'class':'pestanias'}),
        dict(name='div', attrs={'class':'otros_webs'}),
        dict(name='div', attrs={'id':'pie'})
        ]
    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        seen_titles = set([])
        for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
            section_title = self.tag_to_string(section.find('h1'))
            self.log('section_title(1): ', section_title)
            if section_title == "":
              section_title = "Babelia Feed"
            self.log('section_title(2): ', section_title)
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.elpais.es'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds