View Single Post
Old 08-03-2015, 01:31 PM   #1
mikebw
Member
mikebw began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
What am I missing about 'is_link_wanted' here?

I tried to write a custom one-off script to extract a small subset of articles from a web site based upon the 'title' of the pages. "The Providence Journal" ran a contest for short stories in the style of H.P. Lovecraft, so I started with the RSS XML feed for the "Books" section and then was going to use 'is_link_wanted' to see whether the web page 'title' contained the word "Lovecraft" as a very coarse filter.

Strangely, what I found is that 'is_link_wanted' seems to be ignored. Eventually I simplified my test case to what follows, always returning 'False' from 'is_link_wanted' which, I assume, should result in no articles being included in the output, but in fact I still get all articles appearing in the RSS XML.

I am probably making a very stupid mistake, but I don't know what it is.

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class ProJoLovecraft(BasicNewsRecipe):
    title          = 'ProJo Lovecraft'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True
    recursion      = 1

    def print_version(self, url):
        return url + '&template=printart'

    # TEST: return 'False' for everything
    def is_link_wanted(self, url, tag):
    	return False
           
    feeds          = [
        ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'),
    ]
mikebw is offline   Reply With Quote