What am I missing about 'is_link_wanted' here?

mikebw · 08-03-2015, 02:31 PM

I tried to write a custom one-off script to extract a small subset of articles from a web site based upon the 'title' of the pages. "The Providence Journal" ran a contest for short stories in the style of H.P. Lovecraft, so I started with the RSS XML feed for the "Books" section and then was going to use 'is_link_wanted' to see whether the web page 'title' contained the word "Lovecraft" as a very coarse filter.

Strangely, what I found is that 'is_link_wanted' seems to be ignored. Eventually I simplified my test case to what follows, always returning 'False' from 'is_link_wanted' which, I assume, should result in no articles being included in the output, but in fact I still get all articles appearing in the RSS XML.

I am probably making a very stupid mistake, but I don't know what it is.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class ProJoLovecraft(BasicNewsRecipe):
    title          = 'ProJo Lovecraft'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True
    recursion      = 1

    def print_version(self, url):
        return url + '&template=printart'

    # TEST: return 'False' for everything
    def is_link_wanted(self, url, tag):
    	return False
           
    feeds          = [
        ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'),
    ]

kovidgoyal · 08-03-2015, 02:53 PM

is_link_wanted controls links in the html files that are pointed to by the rss feed, not links in the rss feed itself. For that override get_article_url in your recipe and return None for articles you want skipped.

mikebw · 08-03-2015, 04:44 PM

Thank you, that worked perfectly!

By overriding 'get_article_url' it is easy to inspect the 'title' in the 'article' passed into it, and then to scan that for the presence or absence of a regex (here simply "Lovecraft") and decide whether or not to retrieve.

This probably would have been clearer to write the test in an affirmative sense -- that is, if the regex is present retrieve the article, else do not retrieve -- but I did it this way instead because I needed to develop it by trial and error to see how to retrieve all articles before attempting to write an if-then test that skipped some.

Despite the excellent quality of the source code examples in Calibre, my knowledge of Python is close to non-existent and I have to look up everything in the documentation.

Here is the code as tested and working, which may be useful as an example to someone:

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class ProJoLovecraft(BasicNewsRecipe):
    title          = 'ProJo Lovecraft'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True
    recursion      = 1

    def print_version(self, url):
        return url + '&template=printart'

    def get_article_url(self, article):
        ans = article.get('title', None)
        if(None == re.search(r'Lovecraft', ans)):
            return None
        else:
            return article.get('link', None)
                         
    feeds          = [
        ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'),
    ]

08-03-2015, 02:31 PM	#1
mikebw Member Posts: 22 Karma: 10 Join Date: Nov 2014 Device: none	What am I missing about 'is_link_wanted' here? I tried to write a custom one-off script to extract a small subset of articles from a web site based upon the 'title' of the pages. "The Providence Journal" ran a contest for short stories in the style of H.P. Lovecraft, so I started with the RSS XML feed for the "Books" section and then was going to use 'is_link_wanted' to see whether the web page 'title' contained the word "Lovecraft" as a very coarse filter. Strangely, what I found is that 'is_link_wanted' seems to be ignored. Eventually I simplified my test case to what follows, always returning 'False' from 'is_link_wanted' which, I assume, should result in no articles being included in the output, but in fact I still get all articles appearing in the RSS XML. I am probably making a very stupid mistake, but I don't know what it is. Code: #!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class ProJoLovecraft(BasicNewsRecipe): title = 'ProJo Lovecraft' oldest_article = 7 max_articles_per_feed = 100 auto_cleanup = True recursion = 1 def print_version(self, url): return url + '&template=printart' # TEST: return 'False' for everything def is_link_wanted(self, url, tag): return False feeds = [ ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'), ]

08-03-2015, 04:44 PM	#3
mikebw Member Posts: 22 Karma: 10 Join Date: Nov 2014 Device: none	Thank you, that worked perfectly! By overriding 'get_article_url' it is easy to inspect the 'title' in the 'article' passed into it, and then to scan that for the presence or absence of a regex (here simply "Lovecraft") and decide whether or not to retrieve. This probably would have been clearer to write the test in an affirmative sense -- that is, if the regex is present retrieve the article, else do not retrieve -- but I did it this way instead because I needed to develop it by trial and error to see how to retrieve all articles before attempting to write an if-then test that skipped some. Despite the excellent quality of the source code examples in Calibre, my knowledge of Python is close to non-existent and I have to look up everything in the documentation. Here is the code as tested and working, which may be useful as an example to someone: Code: #!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class ProJoLovecraft(BasicNewsRecipe): title = 'ProJo Lovecraft' oldest_article = 7 max_articles_per_feed = 100 auto_cleanup = True recursion = 1 def print_version(self, url): return url + '&template=printart' def get_article_url(self, article): ans = article.get('title', None) if(None == re.search(r'Lovecraft', ans)): return None else: return article.get('link', None) feeds = [ ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'), ]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre Metadata From Amazon Missing. What Am I Missing?	BruticusBob	Library Management	4	07-23-2013 09:43 PM
Missing ll's	chrishalliwelluk	Calibre	2	12-10-2010 09:09 AM
What am I missing?	cavi	Apple Devices	7	11-24-2010 03:45 AM
PRS-600 Am I just missing something here?!	linzylou63	Sony Reader	0	09-03-2010 03:44 PM
Missing covers, missing content. Getting worse with each sync.	Mememememe	Kobo Reader	7	06-16-2010 10:02 AM

08-03-2015, 02:53 PM	#2
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	is_link_wanted controls links in the html files that are pointed to by the rss feed, not links in the rss feed itself. For that override get_article_url in your recipe and return None for articles you want skipped.

Advert