MobileRead Forums - View Single Post

josepinto · 04-05-2013, 07:26 AM

Quote:

Originally Posted by oneillpt

The replacement of content by boilerplate editorial disclaimers for some articles seems to be due to use of auto_cleanup. Try the version below where this is disabled. I have used keep_only_tags and remove_tags instead. (The unicode accented characters in the title caused problems for me. Put them back if they work for you)

Code:

class AdvancedUserRecipe1365070687(BasicNewsRecipe):
  title = u'Diario de Noticias'
  oldest_article = 7
  max_articles_per_feed = 100
  #auto_cleanup = True
  keep_only_tags = [dict(name='div', attrs={'id':'cln-esqmid'}) ]
  remove_tags    = [ dict(name='table', attrs={'class':'TabFerramentasInf'}) ]

  feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'), 
    (u'Globo', u'http://feeds.dn.pt/DN-Globo'), 
    (u'Economia', u'http://feeds.dn.pt/DN-Economia'), 
    (u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'), 
    (u'Artes', u'http://feeds.dn.pt/DN-Artes'), 
    (u'TV & Media', u'http://feeds.dn.pt/DN-Media'), 
    (u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'), 
    (u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')
    ]

Thanks,

All text is extracted now.

Several sections could also be added but I personally do not use them:

Desporto:
http://feeds.dn.pt/DN-Desporto

Cartaz:
http://feeds.dn.pt/DN-Cartaz

Política:
http://feeds.dn.pt/DN-Politica

Gente:
http://feeds.dn.pt/DN-Gente

Galerias:
http://feeds.dn.pt/DN-Galeria

Side note: Terms of use of the feeds of this newspaper: http://www.dn.pt/info/termosdeuso.aspx

José Pinto