Quote:
Originally Posted by oneillpt
The replacement of content by boilerplate editorial disclaimers for some articles seems to be due to use of auto_cleanup. Try the version below where this is disabled. I have used keep_only_tags and remove_tags instead. (The unicode accented characters in the title caused problems for me. Put them back if they work for you)
Code:
class AdvancedUserRecipe1365070687(BasicNewsRecipe):
title = u'Diario de Noticias'
oldest_article = 7
max_articles_per_feed = 100
#auto_cleanup = True
keep_only_tags = [dict(name='div', attrs={'id':'cln-esqmid'}) ]
remove_tags = [ dict(name='table', attrs={'class':'TabFerramentasInf'}) ]
feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'),
(u'Globo', u'http://feeds.dn.pt/DN-Globo'),
(u'Economia', u'http://feeds.dn.pt/DN-Economia'),
(u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'),
(u'Artes', u'http://feeds.dn.pt/DN-Artes'),
(u'TV & Media', u'http://feeds.dn.pt/DN-Media'),
(u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'),
(u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')
]
|
Thanks,
All text is extracted now.
Several sections could also be added but I personally do not use them:
Desporto:
http://feeds.dn.pt/DN-Desporto
Cartaz:
http://feeds.dn.pt/DN-Cartaz
Política:
http://feeds.dn.pt/DN-Politica
Gente:
http://feeds.dn.pt/DN-Gente
Galerias:
http://feeds.dn.pt/DN-Galeria
Side note: Terms of use of the feeds of this newspaper:
http://www.dn.pt/info/termosdeuso.aspx
José Pinto