I'm trying to build a recipe for
http://www.colectivoburbuja.org/?feed=rss2
Things seemed very easy, because the pages are very "clean"...
All the data that I need is under de DIV ID=main
I explicitly set auto_cleanup to FALSE, and no_stylesheets to TRUE in order to avoid the page not showing up.
I am sure that HTML code is being retrieved (for each article downloaded) because I have made a small "trick" in my recipe to print it (for debuging purposes)...
... but ALL of the pages retrieved are BLANK.
Please can somebody help me understand why?
Code:
class AdvancedUserRecipe1330197191(BasicNewsRecipe):
title = u'Colectivo Burbuja'
oldest_article = 7
max_articles_per_feed = 100
auto_cleanup = False
no_stylesheets = True
feeds = [(u'Colectivo Burbuja', u'http://www.colectivoburbuja.org/?feed=rss2')]
keep_only_tags = [dict(name='div', attrs={'id':'main'})]
# keep_only_tags = [dict(attrs={'class':['entry-header','entry-content','comments-title','comment-content','reply']})]
# Let's see what we are downloading...
def print_version(self, url):
#We don't search for any print version... the only purpose is printing debug information.
print "print_version:", url
soupinicial = self.index_to_soup(url)
a= soupinicial.find('div', attrs={'id':'main'})
print "------------------------------------------------------------------------------"
print a
print "------------------------------------------------------------------------------"
return url # return the same parameter we received (do nothing)
Thanks.