MobileRead Forums - View Single Post

JeffreyZhao · 06-19-2013, 11:20 AM

I'm using the test recipe to crawl infoq.com:

Code:

class InfoQ_Test(BasicNewsRecipe):
    title = u'InfoQ Test'
    auto_cleanup = True
    no_stylesheets = True
    
    keep_only_tags = [dict(id=['content'])]

    def parse_index(self):
        items = []
        
        items.append({ 'title': 'Article1', 'url': 'http://www.infoq.com/news/2013/06/stratos-2' })
        items.append({ 'title': 'Article2', 'url': 'http://www.infoq.com/news/2013/06/document-messaging-analysis' })
                
        return [("Default", items)]

I want to keep the "div" with id="content" only from the whole page, but calibre just remove all the elements under "body". We could remove the "keep_only_tags" settings to get the article content successfully, but I just want to know why it doesn't work with "keep_only_tags".

Thanks

06-19-2013, 11:20 AM	#1
JeffreyZhao Junior Member Posts: 4 Karma: 10 Join Date: Jun 2013 Device: Kindle Paperwhite	"keep_only_tags" doesn't work? I'm using the test recipe to crawl infoq.com: Code: class InfoQ_Test(BasicNewsRecipe): title = u'InfoQ Test' auto_cleanup = True no_stylesheets = True keep_only_tags = [dict(id=['content'])] def parse_index(self): items = [] items.append({ 'title': 'Article1', 'url': 'http://www.infoq.com/news/2013/06/stratos-2' }) items.append({ 'title': 'Article2', 'url': 'http://www.infoq.com/news/2013/06/document-messaging-analysis' }) return [("Default", items)] I want to keep the "div" with id="content" only from the whole page, but calibre just remove all the elements under "body". We could remove the "keep_only_tags" settings to get the article content successfully, but I just want to know why it doesn't work with "keep_only_tags". Thanks