View Single Post
Old 12-16-2013, 04:04 PM   #30
sup
Connoisseur
sup began at the beginning.
 
Posts: 95
Karma: 10
Join Date: Sep 2013
Device: Kindle Paperwhite (2012)
Quote:
Originally Posted by Pahan View Post
Here is a recipe template that keeps track of already downloaded feed items and only downloads items that it hasn't seen before or whose description, content, or URL have changed. It does so by overriding the parse_feeds method.
Some caveats:
  • I recommend setting max_articles_per_feed and oldest_article to very high values. The first time, the recipe will download every item in every feed, but after that, it will "remember" not to do it again and will grab all new articles no matter how much time had elapsed since the last time it had been run and how many entries had been added. In particular, if you set max_articles_per_feed to a small value and the feed is one that lists all articles in a particular order, you might never see new articles.
  • The list of items downloaded for each feed will be stored in "Calibre configuration directory/recipes/recipe_storage/Recipe title/Feed title". This is probably suboptimal, and there ought to be a persistent storage API for recipes, but it's the best I could come up with.
  • The list of items downloaded is written to disk before the items are actually downloaded. Thus, if an item fails to download for some reason, the recipe won't know, and will not try to download it again. This could probably be fixed by writing the new item lists to temporary files and overriding some method later in the sequence to "commit" by overwriting the downloaded item lists with the new lists. (Thus, if the recipe fails before that, it will never get to that point, so the old lists will remain intact and will redownload next time the recipe is run.)
  • If there are no new items to download and remove_empty_feeds is set to True, the recipe will return an empty list of feeds, which will cause Calibre to raise an error. As far as I can tell, there is nothing that the recipe can do about that without a lot more coding.
  • I've tried to make this code portable, but I've only tested it on Linux systems, so let me know if it doesn't work on the other platforms. I am particularly unsure about newline handling.
Spoiler:
Code:
from calibre.constants import config_dir, CONFIG_DIR_MODE
import os, os.path, urllib
from hashlib import md5

class OnlyLatestRecipe(BasicNewsRecipe):
    title          = u'Unknown News Source'
    oldest_article = 10000
    max_articles_per_feed = 10000
    feeds          = [ ]

    def parse_feeds(self):
        recipe_dir = os.path.join(config_dir,'recipes')
        hash_dir = os.path.join(recipe_dir,'recipe_storage')
        feed_dir = os.path.join(hash_dir,self.title.encode('utf-8').replace('/',':'))
        if not os.path.isdir(feed_dir):
            os.makedirs(feed_dir,mode=CONFIG_DIR_MODE)

        feeds = BasicNewsRecipe.parse_feeds(self)

        for feed in feeds:
            feed_hash = urllib.quote(feed.title.encode('utf-8'),safe='')
            feed_fn = os.path.join(feed_dir,feed_hash)

            past_items = set()
            if os.path.exists(feed_fn):
               with file(feed_fn) as f:
                   for h in f:
                       past_items.add(h.strip())
                       
            cur_items = set()
            for article in feed.articles[:]:
                item_hash = md5()
                if article.content: item_hash.update(article.content.encode('utf-8'))
                if article.summary: item_hash.update(article.summary.encode('utf-8'))
                item_hash = item_hash.hexdigest()
                if article.url:
                    item_hash = article.url + ':' + item_hash
                cur_items.add(item_hash)
                if item_hash in past_items:
                    feed.articles.remove(article)
            with file(feed_fn,'w') as f:
                for h in cur_items:
                    f.write(h+'\n')

        remove = [f for f in feeds if len(f) == 0 and
                self.remove_empty_feeds]
        for f in remove:
            feeds.remove(f)

        return feeds
This is a simple version of the above method that does not keep track of changes and assumes that what was once put online never changes (which is generally not true but for some feeds is). Also, it is using the parse_index method instead of parse_feeds as it assumes you to scrap a website. The same caveats but the first one apply. This recipe only keeps the last twenty articles for any given section - if you need more, change the limit.
Code:
Spoiler:
from calibre.constants import config_dir, CONFIG_DIR_MODE import os def parse_index(self): # Read already downloaded articles recipe_dir = os.path.join(config_dir,'recipes') old_articles = os.path.join(recipe_dir,self.title.encode('utf-8').replace('/',':')) past_items = [] if os.path.exists(old_articles): with file(old_articles) as f: for h in f: l = h.strip().split(" ") past_items.append((l[0]," ".join(l[1:]))) old_urls = [x[0] for x in past_items] count_items = {} current_items = [] # Keep a list of only 20 latest articles for each section past_items.reverse() for item in past_items: if item[1] in count_items.keys(): if count_items[item[1]] < 20: count_items[item[1]] += 1 current_items.append(item) else: count_items[item[1]] = 1 current_items.append(item) current_items.reverse() # do stuff to get 'list_of_articles' containing dictionnaries in the form like this {'title':title,'url':url} # and to get variable 'feed_name'; see the following link for details: # http://manual.calibre-ebook.com/news_recipe.html#calibre.web.feeds.news.BasicNewsRecipe.parse_index ans = [] for article in list_of_articles if article['url'] not in old_urls: current_items.append((article['url'],feed_name)) ans.append((feed_name,list_of articles # Write already downloaded articles with file(old_articles,'w') as f: f.write('\n'.join('{} {}'.format(*x) for x in current_items)) return ans

Last edited by sup; 01-14-2014 at 12:50 PM.
sup is offline   Reply With Quote