I tried to write a custom one-off script to extract a small subset of articles from a web site based upon the 'title' of the pages. "The Providence Journal" ran a
contest for short stories in the style of H.P. Lovecraft, so I started with the RSS XML feed for the "Books" section and then was going to use 'is_link_wanted' to see whether the web page 'title' contained the word "Lovecraft" as a very coarse filter.
Strangely, what I found is that 'is_link_wanted' seems to be ignored. Eventually I simplified my test case to what follows, always returning 'False' from 'is_link_wanted' which, I assume,
should result in no articles being included in the output, but in fact I still get all articles appearing in the RSS XML.
I am probably making a very stupid mistake, but I don't know what it is.
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
class ProJoLovecraft(BasicNewsRecipe):
title = 'ProJo Lovecraft'
oldest_article = 7
max_articles_per_feed = 100
auto_cleanup = True
recursion = 1
def print_version(self, url):
return url + '&template=printart'
# TEST: return 'False' for everything
def is_link_wanted(self, url, tag):
return False
feeds = [
('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'),
]