View Single Post
Old 04-05-2013, 07:09 AM   #4
josepinto
Connoisseur
josepinto began at the beginning.
 
Posts: 50
Karma: 10
Join Date: Apr 2005
Device: Nokia 5320
Not all text extracted

Quote:
Originally Posted by oneillpt View Post
The replacement of content by boilerplate editorial disclaimers for some articles seems to be due to use of auto_cleanup. Try the version below where this is disabled. I have used keep_only_tags and remove_tags instead. (The unicode accented characters in the title caused problems for me. Put them back if they work for you)

Code:
class AdvancedUserRecipe1365070687(BasicNewsRecipe):
  title = u'Diario de Noticias'
  oldest_article = 7
  max_articles_per_feed = 100
  #auto_cleanup = True
  keep_only_tags = [dict(name='div', attrs={'id':'cln-esqmid'}) ]
  remove_tags    = [ dict(name='table', attrs={'class':'TabFerramentasInf'}) ]

  feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'), 
    (u'Globo', u'http://feeds.dn.pt/DN-Globo'), 
    (u'Economia', u'http://feeds.dn.pt/DN-Economia'), 
    (u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'), 
    (u'Artes', u'http://feeds.dn.pt/DN-Artes'), 
    (u'TV & Media', u'http://feeds.dn.pt/DN-Media'), 
    (u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'), 
    (u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')
    ]
Hi again,

In several articles, only the title and the first paragraph of the text, wich is in bold, are extracted, but not the rest of the article.

I tried to insert use_embedded_content = False in the recipe but it doesn´t change anything.

José Pinto
josepinto is offline   Reply With Quote