Duplicated news in recipe with multiple feeds

romualdinho · 09-21-2011, 11:44 PM

Hello everybody,

I have a question about the configuration of recipes.

There's a site that has an RSS file for each tag/topic used in the articles.
In my recipe I added some feeds for the topics i'm interested in.
The problem is an article has many tags, so it can be in two or more feeds and the article will be twice (or three times, or four...)

Is it possible to remove the duplicated articles from the recipes?

This is my code:

Code:

class AdvancedUserRecipe1316656601(BasicNewsRecipe):
    title          = u'Mongabay'
    oldest_article = 120
    max_articles_per_feed = 100
    auto_cleanup = True
    remove_tags    = [dict(name='p', attrs={'class':'hide'})]

    feeds          = [(u'Amazon', u'http://news.mongabay.com/xml/amazon1.xml'), (u'Species discovery', u'http://news.mongabay.com/xml/species_discovery1.xml'), (u'Rainforest animals', u'http://news.mongabay.com/xml/rainforest%20animals1.xml'), (u'Cats', u'http://news.mongabay.com/xml/cats1.xml'), (u'Pantanal', u'http://news.mongabay.com/xml/pantanal1.xml')]

    def print_version(self, url):
        return url.replace('http://', 'http://print.')

It's a basic recipe (yet

)
An example could be: the feed titled 'Amazon' has an article that also is in 'Rainforest animals'.
What I want is to have only one of those duplicated articles. Is that possible?

Any help will be appreciated.

Starson17 · 09-22-2011, 04:26 PM

Quote:

Originally Posted by romualdinho

Hello everybody,

I have a question about the configuration of recipes.

There's a site that has an RSS file for each tag/topic used in the articles.
In my recipe I added some feeds for the topics i'm interested in.
The problem is an article has many tags, so it can be in two or more feeds and the article will be twice (or three times, or four...)

Is it possible to remove the duplicated articles from the recipes?

What I want is to have only one of those duplicated articles. Is that possible?

Any help will be appreciated.

Yes, it's possible, but you will need to do some work. Read the re-usable code sticky to understand the structure of feeds and how articles are stored in them. You will want to process the feed(s) to remove duplicate article URLs. You might also want to look at the code that removes duplicate articles between multiple runs of the same recipe. It's doing nearly the same thing, except it has to store the URLs between recipe runs.

romualdinho · 09-22-2011, 08:50 PM

Thank you Starson for your answer.
I'll take a look at the thread. I have a lot to learn

Thanks again!

romualdinho · 09-26-2011, 11:34 PM

I used Pahan's code to get rid of already downloaded items and also filtered the code, but I couldn't resolve the main problem: not to get repeated articles from different feeds in the same run. I've been spending some time with this without success.

Though I've done some things in PHP for websites, I couldn't say I'm a programmer, so I will try a little more, and in case of failing again, I'll have to skip the articles in the kindle while reading

Regards.

PS: that's my code now:

Spoiler:

Code:

from calibre.constants import config_dir, CONFIG_DIR_MODE
import re, os, os.path, urllib
from hashlib import md5

class OnlyLatestRecipe(BasicNewsRecipe):
    title                 = u'Mongabay'
    oldest_article        = 30
    max_articles_per_feed = 50
    auto_cleanup          = True
    description           = 'Mongabay.com seeks to raise interest in and appreciation of wild lands and wildlife, while examining the impact of emerging trends in climate, technology, economics, and finance on conservation and development'
    category              = 'Ecología'
    language              = 'en'
    remove_tags           = [dict(name='p', attrs={'class':'hide'})]
    auto_cleanup_keep     = '//div[@class="imageWrap"]'
    remove_javascript     = True
    extra_css             = 'font span.italic { display: block; padding-bottom: 12px; }'
    preprocess_regexps    = [
        (re.compile(r'Mongabay\.com seeks.*and development\.', re.DOTALL|re.IGNORECASE),
        lambda match: ''),
        (re.compile(r'Please consider the environment.*PDF version</a>', re.DOTALL|re.IGNORECASE), # I AM considering the environment
        lambda match: ''),
        (re.compile(r'<A HREF="http://www.mongabay.com/copyright.htm">Copyright mongabay 2009', re.DOTALL|re.IGNORECASE),
        lambda match: ''),
        (re.compile(r' - Print', re.DOTALL|re.IGNORECASE),
        lambda match: ''),
        (re.compile(r'(<br\s*\/?>\s*){3,}', re.DOTALL|re.IGNORECASE),
        lambda match: ' <br /><br /> '),
        (re.compile(r'<table', re.DOTALL|re.IGNORECASE),
        lambda match: '<div> <table'),
        (re.compile(r'</table>', re.DOTALL|re.IGNORECASE),
        lambda match: '</table></div> <br />'),
        (re.compile(r'<div> <table align=right>', re.DOTALL|re.IGNORECASE),
        lambda match: '<div class="imageWrap"> <table align="left">'),
        (re.compile(r'<td width=20></td>', re.DOTALL|re.IGNORECASE),
        lambda match: '')
		
    ]
    conversion_options = {
         'comments'        : description
        ,'language'        : language
		,'linearize_tables': True
    }
    feeds   = [
                 (u'Amazon', u'http://news.mongabay.com/xml/amazon1.xml')
                ,(u'Species discovery', u'http://news.mongabay.com/xml/species_discovery1.xml')
                ,(u'Rainforest animals', u'http://news.mongabay.com/xml/rainforest%20animals1.xml')
                ,(u'Cats', u'http://news.mongabay.com/xml/cats1.xml')
                ,(u'Pantanal', u'http://news.mongabay.com/xml/pantanal1.xml')
                ,(u'Boreal forests', u'http://news.mongabay.com/xml/boreal_forests1.xml')
                ,(u'Atlantic Forest', 'http://news.mongabay.com/xml/Atlantic%20Forest1.xml')
                ,(u'Panama', 'http://news.mongabay.com/xml/Panama1.xml')
            ]

    def print_version(self, url):
        return url.replace('http://', 'http://print.')

    def parse_feeds(self):
        recipe_dir = os.path.join(config_dir,'recipes')
        hash_dir = os.path.join(recipe_dir,'recipe_storage')
        feed_dir = os.path.join(hash_dir,self.title.encode('utf-8').replace('/',':'))
        if not os.path.isdir(feed_dir):
            os.makedirs(feed_dir,mode=CONFIG_DIR_MODE)

        feeds = BasicNewsRecipe.parse_feeds(self)

        for feed in feeds:
            feed_hash = urllib.quote(feed.title.encode('utf-8'),safe='')
            feed_fn = os.path.join(feed_dir,feed_hash)

            past_items = set()
            if os.path.exists(feed_fn):
               with file(feed_fn) as f:
                   for h in f:
                       past_items.add(h.strip())
                       
            cur_items = set()
            for article in feed.articles[:]:
                item_hash = md5()
                if article.content: item_hash.update(article.content.encode('utf-8'))
                if article.summary: item_hash.update(article.summary.encode('utf-8'))
                item_hash = item_hash.hexdigest()
                if article.url:
                    item_hash = article.url + ':' + item_hash
                cur_items.add(item_hash)
                if item_hash in past_items:
                    feed.articles.remove(article)
            with file(feed_fn,'w') as f:
                for h in cur_items:
                    f.write(h+'\n')

        remove = [f for f in feeds if len(f) == 0 and
                self.remove_empty_feeds]
        for f in remove:
            feeds.remove(f)

        return feeds

adoucette · 03-14-2012, 06:26 PM

Did you ever find a solution for removing duplicates from multiple feeds in the same run? If so, is it re-usable code?
I've tried using the code adapted from the NewScientist below

Code:

...
    filterDuplicates = True
    url_list = []
...
    def print_version(self, url):
        if self.filterDuplicates:
            if url in self.url_list:
                return
        return url.replace('/article/', '/printarticle/')

but when I use it, I get an epub with just empty feeds...it takes out all the URLs

romualdinho · 09-24-2012, 09:27 PM

Sorry for the late response, but unfortunately I couldn't find a solution. Last months I've been very busy to do some research.

Now I'm skipping the duplicated articles as I read.

09-21-2011, 11:44 PM	#1
romualdinho Junior Member Posts: 4 Karma: 10 Join Date: Sep 2011 Location: Montevideo, Uruguay Device: Kindle3	Duplicated news in recipe with multiple feeds Hello everybody, I have a question about the configuration of recipes. There's a site that has an RSS file for each tag/topic used in the articles. In my recipe I added some feeds for the topics i'm interested in. The problem is an article has many tags, so it can be in two or more feeds and the article will be twice (or three times, or four...) Is it possible to remove the duplicated articles from the recipes? This is my code: Code: class AdvancedUserRecipe1316656601(BasicNewsRecipe): title = u'Mongabay' oldest_article = 120 max_articles_per_feed = 100 auto_cleanup = True remove_tags = [dict(name='p', attrs={'class':'hide'})] feeds = [(u'Amazon', u'http://news.mongabay.com/xml/amazon1.xml'), (u'Species discovery', u'http://news.mongabay.com/xml/species_discovery1.xml'), (u'Rainforest animals', u'http://news.mongabay.com/xml/rainforest%20animals1.xml'), (u'Cats', u'http://news.mongabay.com/xml/cats1.xml'), (u'Pantanal', u'http://news.mongabay.com/xml/pantanal1.xml')] def print_version(self, url): return url.replace('http://', 'http://print.') It's a basic recipe (yet ) An example could be: the feed titled 'Amazon' has an article that also is in 'Rainforest animals'. What I want is to have only one of those duplicated articles. Is that possible? Any help will be appreciated.

03-14-2012, 06:26 PM	#5
adoucette Member Posts: 24 Karma: 140 Join Date: Sep 2011 Device: Nook Color (rooted?)	Did you ever find a solution for removing duplicates from multiple feeds in the same run? If so, is it re-usable code? I've tried using the code adapted from the NewScientist below Code: ... filterDuplicates = True url_list = [] ... def print_version(self, url): if self.filterDuplicates: if url in self.url_list: return return url.replace('/article/', '/printarticle/') but when I use it, I get an epub with just empty feeds...it takes out all the URLs Last edited by adoucette; 03-14-2012 at 09:27 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Multiple Kindles, Multiple News Feeds	filmprof	Recipes	10	02-20-2012 10:38 AM
option to add multiple custom OPDS feeds	ilovejedd	EPUBReader	2	09-17-2011 02:18 PM
Multiple News Feeds Problem on Kindle	Mixx	Calibre	4	05-28-2011 05:02 PM
Displaying Multiple RSS Feeds in a Single Section?	commandercup	Recipes	5	03-01-2011 05:34 PM
One Recipe, Multiple Feeds, Different Printer-Friendly Subs	DTM	Recipes	9	02-11-2011 01:04 PM

09-22-2011, 08:50 PM	#3
romualdinho Junior Member Posts: 4 Karma: 10 Join Date: Sep 2011 Location: Montevideo, Uruguay Device: Kindle3	Thank you Starson for your answer. I'll take a look at the thread. I have a lot to learn Thanks again!

09-24-2012, 09:27 PM	#6
romualdinho Junior Member Posts: 4 Karma: 10 Join Date: Sep 2011 Location: Montevideo, Uruguay Device: Kindle3	Sorry for the late response, but unfortunately I couldn't find a solution. Last months I've been very busy to do some research. Now I'm skipping the duplicated articles as I read.

Advert

Advert