Guys, finally managed to produce a working recipe for the printed edition of the famous Brazilian newspaper Folha de São Paulo. I urge everyone to provide some feedback and help me to tackle the long list of pending issues.
What does the recipe currently do?
1. Logs in using a UOL login.
2. Recognizes sections.
3. Downloads all articles from current edition and assign them to the correct section.
To do list:
1. It takes 15 minutes to run the recipe. Can we improve its speed?
2. Section names come from <a name=””> attributes and sometimes are truncated (eg. Ilustrada is shown as ilustra). Should be easy to fix with a dictionary.
3. Get rid of the copyright footer and the “Texto Anterior” and “Próximo Texto” bits.
4. General beautification/cleanup of the articles.
5. Get publication date and use it approprietly.
6. Get masterhead. DONE
7. Find the current cover and use it as cover object.
8. Fix the name to Folha de São Paulo, with ~. DONE
9. Currently works for UOL subscribers. Ideally, should also work for FOLHA subscribers.
10. Allow users to decide which sections they want to download (eg. Never download Campinas, Ribeirão, Comida).
11. The first three articles are usually “capa” (which is the website cover), “fac-simile da capa” (which is the actual newspaper front-page) and “arquivo”. Decide what to do with those.
12. Error message if login/password is wrong.
Having said all that, I am glad it works the way it currently is.
Brasileirada, mandem o feedback e dêem uma mãozinha.
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString
class FSP(BasicNewsRecipe):
title = u'Folha de S\xE3o Paulo - Printed Edition'
__author__ = 'fluzao'
description = u'Folha de S\xE3o Paulo - Printed Edition (UOL subscription required)'
INDEX = 'http://www1.folha.uol.com.br/fsp/indices/'
language = 'pt'
no_stylesheets = True
max_articles_per_feed = 30
remove_javascript = True
needs_subscription = True
remove_tags_before = dict(name='b')
remove_tags_after = dict(name='!--/NOTICIA--')
remove_attributes = ['height','width']
masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'
# this solves the problem with truncated content in Kindle
conversion_options = {'linearize_tables' : True}
def get_browser(self):
br = BasicNewsRecipe.get_browser()
if self.username is not None and self.password is not None:
br.open('https://acesso.uol.com.br/login.html')
br.form = br.forms().next()
br['user'] = self.username
br['pass'] = self.password
raw = br.submit().read()
## if 'Please try again' in raw:
## raise Exception('Your username and password are incorrect')
return br
def parse_index(self):
articles = []
soup = self.index_to_soup(self.INDEX)
cover = None
feeds = []
articles = []
section_title = "Preambulo"
for post in soup.findAll('a'):
# if name=True => new section
strpost = str(post)
if strpost.startswith('<a name'):
if articles:
feeds.append((section_title, articles))
self.log()
self.log('--> new section found, creating old section feed: ', section_title)
section_title = post['name']
articles = []
self.log('--> new section title: ', section_title)
if strpost.startswith('<a href'):
url = post['href']
if url.startswith('/fsp'):
url = 'http://www1.folha.uol.com.br'+url
title = self.tag_to_string(post)
self.log()
self.log('--> post: ', post)
self.log('--> url: ', url)
self.log('--> title: ', title)
articles.append({'title':title, 'url':url})
if articles:
feeds.append((section_title, articles))
return feeds