MobileRead Forums - View Single Post

Bortolotto · 06-15-2011, 11:49 PM

Hi BRGriff!!

First of all, I want to say a big "Thank you!!".

Considering your first reply, I made a new recipe (below).

Now, that new version takes around 4 minutes to fetch and create MOBI output. That is really better than first version.

So, I believe this new recipe can be usefull for all friends that are able to read in Brazilian Portuguese (not Spanish

).

The RSS source is a brazilian, well known, news portal called R7.com.
It belongs to a broadcasting corporation called "Rede Record".

Code:

import re

class PortalR7(BasicNewsRecipe):
    title                  = 'Noticias R7'
    __author__             = 'Diniz Bortolotto'
    description            = 'Noticias Portal R7'
    oldest_article         = 2
    max_articles_per_feed  = 20
    encoding               = 'utf8'
    publisher              = 'Rede Record'
    category               = 'news, Brazil'
    language               = 'pt_BR'
    publication_type       = 'newsportal'
    use_embedded_content   = False
    no_stylesheets         = True
    remove_javascript      = True
    remove_attributes      = ['style']

    feeds                  = [
                              (u'Brasil', u'http://www.r7.com/data/rss/brasil.xml'), 
                              (u'Economia', u'http://www.r7.com/data/rss/economia.xml'), 
                              (u'Internacional', u'http://www.r7.com/data/rss/internacional.xml'), 
                              (u'Tecnologia e Ci\xeancia', u'http://www.r7.com/data/rss/tecnologiaCiencia.xml')
                             ]
    reverse_article_order  = True

    keep_only_tags         = [dict(name='div', attrs={'class':'materia'})]
    remove_tags            = [
                              dict(id=['espalhe', 'report-erro']),
                              dict(name='ul', attrs={'class':'controles'}),
                              dict(name='ul', attrs={'class':'relacionados'}),
                              dict(name='div', attrs={'class':'materia_banner'}),
                              dict(name='div', attrs={'class':'materia_controles'})
                             ]

    preprocess_regexps     = [
                              (re.compile(r'<div class="materia">.*<div class="materia_cabecalho">',re.DOTALL|re.IGNORECASE),
                              lambda match: '<div class="materia"><div class="materia_cabecalho">')
                             ]

What do you think about my new recipe?