MobileRead Forums - View Single Post - Recipes - Re-usable code

sup · 12-16-2013, 04:04 PM

Quote:

Originally Posted by Pahan

Here is a recipe template that keeps track of already downloaded feed items and only downloads items that it hasn't seen before or whose description, content, or URL have changed. It does so by overriding the parse_feeds method.
Some caveats:

I recommend setting max_articles_per_feed and oldest_article to very high values. The first time, the recipe will download every item in every feed, but after that, it will "remember" not to do it again and will grab all new articles no matter how much time had elapsed since the last time it had been run and how many entries had been added. In particular, if you set max_articles_per_feed to a small value and the feed is one that lists all articles in a particular order, you might never see new articles.
The list of items downloaded for each feed will be stored in "Calibre configuration directory/recipes/recipe_storage/Recipe title/Feed title". This is probably suboptimal, and there ought to be a persistent storage API for recipes, but it's the best I could come up with.
The list of items downloaded is written to disk before the items are actually downloaded. Thus, if an item fails to download for some reason, the recipe won't know, and will not try to download it again. This could probably be fixed by writing the new item lists to temporary files and overriding some method later in the sequence to "commit" by overwriting the downloaded item lists with the new lists. (Thus, if the recipe fails before that, it will never get to that point, so the old lists will remain intact and will redownload next time the recipe is run.)
If there are no new items to download and remove_empty_feeds is set to True, the recipe will return an empty list of feeds, which will cause Calibre to raise an error. As far as I can tell, there is nothing that the recipe can do about that without a lot more coding.
I've tried to make this code portable, but I've only tested it on Linux systems, so let me know if it doesn't work on the other platforms. I am particularly unsure about newline handling.

Spoiler:

This is a simple version of the above method that does not keep track of changes and assumes that what was once put online never changes (which is generally not true but for some feeds is). Also, it is using the parse_index method instead of parse_feeds as it assumes you to scrap a website. The same caveats but the first one apply. This recipe only keeps the last twenty articles for any given section - if you need more, change the limit.

Code:

Spoiler: 


from calibre.constants import config_dir, CONFIG_DIR_MODE
import os
def parse_index(self):
    # Read already downloaded articles
    recipe_dir = os.path.join(config_dir,'recipes')
    old_articles = os.path.join(recipe_dir,self.title.encode('utf-8').replace('/',':'))
    past_items = []
    if os.path.exists(old_articles):
       with file(old_articles) as f:
           for h in f:
               l = h.strip().split(" ")
               past_items.append((l[0]," ".join(l[1:])))
    old_urls = [x[0] for x in past_items]
    count_items = {}
    current_items = []
    # Keep a list of only 20 latest articles for each section
    past_items.reverse()
    for item in past_items:
        if item[1] in count_items.keys():
            if count_items[item[1]] < 20:
                count_items[item[1]] += 1
                current_items.append(item)
        else:
            count_items[item[1]] = 1
            current_items.append(item)
    current_items.reverse()  
# do stuff to get 'list_of_articles' containing dictionnaries in the form like this {'title':title,'url':url}
    # and to get variable 'feed_name'; see the following link for details:
    # http://manual.calibre-ebook.com/news_recipe.html#calibre.web.feeds.news.BasicNewsRecipe.parse_index
    ans = []
    for article in list_of_articles
        if article['url'] not in old_urls:
            current_items.append((article['url'],feed_name))
    ans.append((feed_name,list_of articles
    # Write already downloaded articles
    with file(old_articles,'w') as f:
        f.write('\n'.join('{} {}'.format(*x) for x in current_items))
    return ans