Quote:
Originally Posted by miwie
Two suggestions for improvement:
- Add masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
(e.g. before the INDEX line). This adds the logo of El Pais at the top
of the feed overview (which contains only one feed in this case)
- Activate 'no_stylesheets = True' (there are articles with '<style ...'
after the article content which gets included in the EPUB otherwise)
The name of the feed appears as "Unknown feed' which should be renamed somehow.
Good work!
|
I've added the masthead_url as suggested, and activated 'no_stylesheets = True' option, although the styles do not seem to make any noticeable difference in this case.
I've also addressed the "Unknown feed" by replacing a missing title by "Babelia Feed". The revised recipe, with logging for the section title and url extraction, is now:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString
class ElPaisBabelia(BasicNewsRecipe):
title = 'El Pais Babelia'
__author__ = 'oneillpt'
description = 'El Pais Babelia'
masthead_url = 'http://www.elpais.com/im/tit_logo_int.gif'
INDEX = 'http://www.elpais.com/suple/babelia/'
language = 'es'
remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
dict(name='div', attrs={'id':'utilidades'}),
dict(name='div', attrs={'class':'info_relacionada'}),
dict(name='div', attrs={'class':'mod_apoyo'}),
dict(name='div', attrs={'class':'contorno_f'}),
dict(name='div', attrs={'class':'pestanias'}),
dict(name='div', attrs={'class':'otros_webs'}),
dict(name='div', attrs={'id':'pie'})
]
no_stylesheets = True
remove_javascript = True
def parse_index(self):
articles = []
soup = self.index_to_soup(self.INDEX)
cover = None
feeds = []
seen_titles = set([])
for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
section_title = self.tag_to_string(section.find('h1'))
self.log('section_title(1): ', section_title)
if section_title == "":
section_title = "Babelia Feed"
self.log('section_title(2): ', section_title)
articles = []
for post in section.findAll('a', href=True):
url = post['href']
if url.startswith('/'):
url = 'http://www.elpais.es'+url
title = self.tag_to_string(post)
if str(post).find('class=') > 0:
klass = post['class']
if klass != "":
self.log()
self.log('--> post: ', post)
self.log('--> url: ', url)
self.log('--> title: ', title)
self.log('--> class: ', klass)
articles.append({'title':title, 'url':url})
if articles:
feeds.append((section_title, articles))
return feeds