View Single Post
Old 04-05-2013, 06:26 AM   #3
josepinto
Connoisseur
josepinto began at the beginning.
 
Posts: 50
Karma: 10
Join Date: Apr 2005
Device: Nokia 5320
Quote:
Originally Posted by oneillpt View Post
The replacement of content by boilerplate editorial disclaimers for some articles seems to be due to use of auto_cleanup. Try the version below where this is disabled. I have used keep_only_tags and remove_tags instead. (The unicode accented characters in the title caused problems for me. Put them back if they work for you)

Code:
class AdvancedUserRecipe1365070687(BasicNewsRecipe):
  title = u'Diario de Noticias'
  oldest_article = 7
  max_articles_per_feed = 100
  #auto_cleanup = True
  keep_only_tags = [dict(name='div', attrs={'id':'cln-esqmid'}) ]
  remove_tags    = [ dict(name='table', attrs={'class':'TabFerramentasInf'}) ]

  feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'), 
    (u'Globo', u'http://feeds.dn.pt/DN-Globo'), 
    (u'Economia', u'http://feeds.dn.pt/DN-Economia'), 
    (u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'), 
    (u'Artes', u'http://feeds.dn.pt/DN-Artes'), 
    (u'TV & Media', u'http://feeds.dn.pt/DN-Media'), 
    (u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'), 
    (u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')
    ]
Thanks,

All text is extracted now.

Several sections could also be added but I personally do not use them:

Desporto:
http://feeds.dn.pt/DN-Desporto

Cartaz:
http://feeds.dn.pt/DN-Cartaz

Política:
http://feeds.dn.pt/DN-Politica

Gente:
http://feeds.dn.pt/DN-Gente

Galerias:
http://feeds.dn.pt/DN-Galeria

Side note: Terms of use of the feeds of this newspaper: http://www.dn.pt/info/termosdeuso.aspx

José Pinto
josepinto is offline   Reply With Quote