I'm using the test recipe to crawl infoq.com:
Code:
class InfoQ_Test(BasicNewsRecipe):
title = u'InfoQ Test'
auto_cleanup = True
no_stylesheets = True
keep_only_tags = [dict(id=['content'])]
def parse_index(self):
items = []
items.append({ 'title': 'Article1', 'url': 'http://www.infoq.com/news/2013/06/stratos-2' })
items.append({ 'title': 'Article2', 'url': 'http://www.infoq.com/news/2013/06/document-messaging-analysis' })
return [("Default", items)]
I want to keep the "div" with id="content" only from the whole page, but calibre just remove all the elements under "body". We could remove the "keep_only_tags" settings to get the article content successfully, but I just want to know why it doesn't work with "keep_only_tags".
Thanks