MobileRead Forums - View Single Post - What am I missing about 'is_link_wanted' here?

mikebw · 08-03-2015, 01:31 PM

I tried to write a custom one-off script to extract a small subset of articles from a web site based upon the 'title' of the pages. "The Providence Journal" ran a contest for short stories in the style of H.P. Lovecraft, so I started with the RSS XML feed for the "Books" section and then was going to use 'is_link_wanted' to see whether the web page 'title' contained the word "Lovecraft" as a very coarse filter.

Strangely, what I found is that 'is_link_wanted' seems to be ignored. Eventually I simplified my test case to what follows, always returning 'False' from 'is_link_wanted' which, I assume, should result in no articles being included in the output, but in fact I still get all articles appearing in the RSS XML.

I am probably making a very stupid mistake, but I don't know what it is.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class ProJoLovecraft(BasicNewsRecipe):
    title          = 'ProJo Lovecraft'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True
    recursion      = 1

    def print_version(self, url):
        return url + '&template=printart'

    # TEST: return 'False' for everything
    def is_link_wanted(self, url, tag):
    	return False
           
    feeds          = [
        ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'),
    ]

08-03-2015, 01:31 PM	#1
mikebw Member Posts: 22 Karma: 10 Join Date: Nov 2014 Device: none	What am I missing about 'is_link_wanted' here? I tried to write a custom one-off script to extract a small subset of articles from a web site based upon the 'title' of the pages. "The Providence Journal" ran a contest for short stories in the style of H.P. Lovecraft, so I started with the RSS XML feed for the "Books" section and then was going to use 'is_link_wanted' to see whether the web page 'title' contained the word "Lovecraft" as a very coarse filter. Strangely, what I found is that 'is_link_wanted' seems to be ignored. Eventually I simplified my test case to what follows, always returning 'False' from 'is_link_wanted' which, I assume, should result in no articles being included in the output, but in fact I still get all articles appearing in the RSS XML. I am probably making a very stupid mistake, but I don't know what it is. Code: #!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class ProJoLovecraft(BasicNewsRecipe): title = 'ProJo Lovecraft' oldest_article = 7 max_articles_per_feed = 100 auto_cleanup = True recursion = 1 def print_version(self, url): return url + '&template=printart' # TEST: return 'False' for everything def is_link_wanted(self, url, tag): return False feeds = [ ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'), ]