Download Only New Entries when Fetching News

alessandro_q · 03-03-2011, 06:36 PM

I have calibre download 8 news feeds every morning to read over breakfast. It is not always clear which articles I have already read the previous day, as calibre seems to always download the entire feed (which also takes some time to do).

Is there a way to have calibre only download the new entries in each feed?

Cheers.

DoctorOhh · 03-03-2011, 07:03 PM

Quote:

Originally Posted by alessandro_q

Is there a way to have calibre only download the new entries in each feed?

No, but most recipes have a line like this:

oldest_article = 3 #days

If you download these recipes daily then changing the value, via the built in tool under add custom news source, to 1 will minimize overlap.

alessandro_q · 03-03-2011, 07:07 PM

How do you get to the code of existing recipes?

Nevermind, I've found "customize builtin recipe"

Does the number indicate a difference in date, or an actual 24-hour period? If it's the former, I might be better off leaving it at 2 to avoid missing any articles.

DoctorOhh · 03-03-2011, 07:47 PM

Quote:

Originally Posted by alessandro_q

Does the number indicate a difference in date, or an actual 24-hour period? If it's the former, I might be better off leaving it at 2 to avoid missing any articles.

I'm not sure, try 2 and adjust if needed.

You might learn more here.

Starson17 · 03-03-2011, 10:54 PM

Quote:

Originally Posted by alessandro_q

Is there a way to have calibre only download the new entries in each feed?

Yes.

See here.
There are other options that are less well developed.

alessandro_q · 03-03-2011, 11:40 PM

Thanks Starson. Can you give me some advice on how to include the code into the existing code for a news source. For example, here is the code for Gizmodo:

Code:

__license__   = 'GPL v3'
__copyright__ = '2010, Darko Miletic <darko.miletic at gmail.com>'
'''
gizmodo.com
'''

from calibre.web.feeds.news import BasicNewsRecipe

class Gizmodo(BasicNewsRecipe):
    title                 = 'Gizmodo'
    __author__            = 'Darko Miletic'
    description           = "Gizmodo, the gadget guide. So much in love with shiny new toys, it's unnatural."
    publisher             = 'gizmodo.com'
    category              = 'news, IT, Internet, gadgets'
    oldest_article        = 2
    max_articles_per_feed = 100
    no_stylesheets        = True
    encoding              = 'utf-8'
    use_embedded_content  = True
    language              = 'en'
    masthead_url          = 'http://cache.gawkerassets.com/assets/gizmodo.com/img/logo.png'

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    feeds = [(u'Articles', u'http://feeds.gawker.com/gizmodo/vip?format=xml')]

    remove_tags = [
            {'class': 'feedflare'},
    ]


    def preprocess_html(self, soup):
        return self.adeify_images(soup)

Starson17 · 03-04-2011, 09:45 AM

Quote:

Originally Posted by alessandro_q

Thanks Starson. Can you give me some advice on how to include the code into the existing code for a news source.

My advice would be not to do it. I wrote similar code and wasn't happy with it. Any error in a download and you don't get the articles the next day. You have to get every issue and read them in order. Having the most recent issue isn't enough. Ultimately, I decided I preferred keeping the ebook exactly like the feed, only having to successfully download one issue and just skipping over any articles I'd already read.

Have you tried the code I pointed you to?

Quote:

For example, here is the code for Gizmodo:

There's no need to post a copy of the code for builtin recipes.

alessandro_q · 03-04-2011, 10:09 PM

I have not tried the code you pointed to. I meant to ask how to use the template. Here is my attempt:

Code:

from calibre.constants import config_dir, CONFIG_DIR_MODE
import os, os.path, urllib
from hashlib import md5

class OnlyLatestRecipe(BasicNewsRecipe):
    title          = u'Gizmodo'
	__author__            = 'Darko Miletic'
    description           = "Gizmodo, the gadget guide. So much in love with shiny new toys, it's unnatural."
    publisher             = 'gizmodo.com'
    category              = 'news, IT, Internet, gadgets'
	
    oldest_article = 10000
    max_articles_per_feed = 10000
    no_stylesheets        = True
    encoding              = 'utf-8'
    use_embedded_content  = True
    language              = 'en'
    masthead_url          = 'http://cache.gawkerassets.com/assets/gizmodo.com/img/logo.png'
	
    feeds          = [(u'Articles', u'http://feeds.gawker.com/gizmodo/vip?format=xml')]

    def parse_feeds(self):
        recipe_dir = os.path.join(config_dir,'recipes')
        hash_dir = os.path.join(recipe_dir,'recipe_storage')
        feed_dir = os.path.join(hash_dir,self.title.encode('utf-8').replace('/',':'))
        if not os.path.isdir(feed_dir):
            os.makedirs(feed_dir,mode=CONFIG_DIR_MODE)

        feeds = BasicNewsRecipe.parse_feeds(self)

        for feed in feeds:
            feed_hash = urllib.quote(feed.title.encode('utf-8'),safe='')
            feed_fn = os.path.join(feed_dir,feed_hash)

            past_items = set()
            if os.path.exists(feed_fn):
               with file(feed_fn) as f:
                   for h in f:
                       past_items.add(h.strip())
                       
            cur_items = set()
            for article in feed.articles[:]:
                item_hash = md5()
                if article.content: item_hash.update(article.content.encode('utf-8'))
                if article.summary: item_hash.update(article.summary.encode('utf-8'))
                item_hash = item_hash.hexdigest()
                if article.url:
                    item_hash = article.url + ':' + item_hash
                cur_items.add(item_hash)
                if item_hash in past_items:
                    feed.articles.remove(article)
            with file(feed_fn,'w') as f:
                for h in cur_items:
                    f.write(h+'\n')

        remove = [f for f in feeds if len(f) == 0 and
                self.remove_empty_feeds]
        for f in remove:
            feeds.remove(f)

        return feeds
		
	 conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    remove_tags = [
            {'class': 'feedflare'},
    ]


    def preprocess_html(self, soup):
        return self.adeify_images(soup)

Is there anything wrong with this code?

Starson17 · 03-05-2011, 11:03 AM

Quote:

Originally Posted by alessandro_q

Is there anything wrong with this code?

You have several indent errors, starting with the author. Just run it and it will report the errors. Don't use tabs; only use spaces.

03-03-2011, 06:36 PM	#1
alessandro_q Member Posts: 23 Karma: 90010 Join Date: Mar 2011 Device: Kindle 3	Download Only New Entries when Fetching News I have calibre download 8 news feeds every morning to read over breakfast. It is not always clear which articles I have already read the previous day, as calibre seems to always download the entire feed (which also takes some time to do). Is there a way to have calibre only download the new entries in each feed? Cheers.

03-03-2011, 07:07 PM	#3
alessandro_q Member Posts: 23 Karma: 90010 Join Date: Mar 2011 Device: Kindle 3	How do you get to the code of existing recipes? Nevermind, I've found "customize builtin recipe" Does the number indicate a difference in date, or an actual 24-hour period? If it's the former, I might be better off leaving it at 2 to avoid missing any articles. Last edited by alessandro_q; 03-03-2011 at 07:10 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Error for fetching news.	nick_martin	Calibre	0	11-26-2010 02:52 AM
Fetching News has gone bad...	rogue_ronin	Calibre	6	09-03-2010 09:41 AM
automating news fetching	zerozombie72	Calibre	6	02-16-2010 05:31 PM
Fetching News In Calibre	Rootman	Calibre	2	11-11-2009 08:06 PM
Question about fetching the news	spoudaios	Sony Reader Dev Corner	4	01-27-2008 06:01 PM