Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-03-2015, 01:31 PM   #1
mikebw
Member
mikebw began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
What am I missing about 'is_link_wanted' here?

I tried to write a custom one-off script to extract a small subset of articles from a web site based upon the 'title' of the pages. "The Providence Journal" ran a contest for short stories in the style of H.P. Lovecraft, so I started with the RSS XML feed for the "Books" section and then was going to use 'is_link_wanted' to see whether the web page 'title' contained the word "Lovecraft" as a very coarse filter.

Strangely, what I found is that 'is_link_wanted' seems to be ignored. Eventually I simplified my test case to what follows, always returning 'False' from 'is_link_wanted' which, I assume, should result in no articles being included in the output, but in fact I still get all articles appearing in the RSS XML.

I am probably making a very stupid mistake, but I don't know what it is.

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class ProJoLovecraft(BasicNewsRecipe):
    title          = 'ProJo Lovecraft'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True
    recursion      = 1

    def print_version(self, url):
        return url + '&template=printart'

    # TEST: return 'False' for everything
    def is_link_wanted(self, url, tag):
    	return False
           
    feeds          = [
        ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'),
    ]
mikebw is offline   Reply With Quote
Old 08-03-2015, 01:53 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,243
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
is_link_wanted controls links in the html files that are pointed to by the rss feed, not links in the rss feed itself. For that override get_article_url in your recipe and return None for articles you want skipped.
kovidgoyal is offline   Reply With Quote
Advert
Old 08-03-2015, 03:44 PM   #3
mikebw
Member
mikebw began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Nov 2014
Device: none
Thank you, that worked perfectly!

By overriding 'get_article_url' it is easy to inspect the 'title' in the 'article' passed into it, and then to scan that for the presence or absence of a regex (here simply "Lovecraft") and decide whether or not to retrieve.

This probably would have been clearer to write the test in an affirmative sense -- that is, if the regex is present retrieve the article, else do not retrieve -- but I did it this way instead because I needed to develop it by trial and error to see how to retrieve all articles before attempting to write an if-then test that skipped some.

Despite the excellent quality of the source code examples in Calibre, my knowledge of Python is close to non-existent and I have to look up everything in the documentation.

Here is the code as tested and working, which may be useful as an example to someone:

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class ProJoLovecraft(BasicNewsRecipe):
    title          = 'ProJo Lovecraft'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True
    recursion      = 1

    def print_version(self, url):
        return url + '&template=printart'

    def get_article_url(self, article):
        ans = article.get('title', None)
        if(None == re.search(r'Lovecraft', ans)):
            return None
        else:
            return article.get('link', None)
                         
    feeds          = [
        ('Lovecraft', 'http://www.providencejournal.com/entertainment/books?template=rss&mime=xml'),
    ]
mikebw is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Calibre Metadata From Amazon Missing. What Am I Missing? BruticusBob Library Management 4 07-23-2013 08:43 PM
Missing ll's chrishalliwelluk Calibre 2 12-10-2010 08:09 AM
What am I missing? cavi Apple Devices 7 11-24-2010 02:45 AM
PRS-600 Am I just missing something here?! linzylou63 Sony Reader 0 09-03-2010 02:44 PM
Missing covers, missing content. Getting worse with each sync. Mememememe Kobo Reader 7 06-16-2010 09:02 AM


All times are GMT -4. The time now is 08:40 AM.


MobileRead.com is a privately owned, operated and funded community.