Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-21-2011, 11:44 PM   #1
romualdinho
Junior Member
romualdinho began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2011
Location: Montevideo, Uruguay
Device: Kindle3
Duplicated news in recipe with multiple feeds

Hello everybody,

I have a question about the configuration of recipes.

There's a site that has an RSS file for each tag/topic used in the articles.
In my recipe I added some feeds for the topics i'm interested in.
The problem is an article has many tags, so it can be in two or more feeds and the article will be twice (or three times, or four...)

Is it possible to remove the duplicated articles from the recipes?

This is my code:
Code:
class AdvancedUserRecipe1316656601(BasicNewsRecipe):
    title          = u'Mongabay'
    oldest_article = 120
    max_articles_per_feed = 100
    auto_cleanup = True
    remove_tags    = [dict(name='p', attrs={'class':'hide'})]

    feeds          = [(u'Amazon', u'http://news.mongabay.com/xml/amazon1.xml'), (u'Species discovery', u'http://news.mongabay.com/xml/species_discovery1.xml'), (u'Rainforest animals', u'http://news.mongabay.com/xml/rainforest%20animals1.xml'), (u'Cats', u'http://news.mongabay.com/xml/cats1.xml'), (u'Pantanal', u'http://news.mongabay.com/xml/pantanal1.xml')]

    def print_version(self, url):
        return url.replace('http://', 'http://print.')
It's a basic recipe (yet )
An example could be: the feed titled 'Amazon' has an article that also is in 'Rainforest animals'.
What I want is to have only one of those duplicated articles. Is that possible?

Any help will be appreciated.
romualdinho is offline   Reply With Quote
Old 09-22-2011, 04:26 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by romualdinho View Post
Hello everybody,

I have a question about the configuration of recipes.

There's a site that has an RSS file for each tag/topic used in the articles.
In my recipe I added some feeds for the topics i'm interested in.
The problem is an article has many tags, so it can be in two or more feeds and the article will be twice (or three times, or four...)

Is it possible to remove the duplicated articles from the recipes?

What I want is to have only one of those duplicated articles. Is that possible?

Any help will be appreciated.
Yes, it's possible, but you will need to do some work. Read the re-usable code sticky to understand the structure of feeds and how articles are stored in them. You will want to process the feed(s) to remove duplicate article URLs. You might also want to look at the code that removes duplicate articles between multiple runs of the same recipe. It's doing nearly the same thing, except it has to store the URLs between recipe runs.
Starson17 is offline   Reply With Quote
Advert
Old 09-22-2011, 08:50 PM   #3
romualdinho
Junior Member
romualdinho began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2011
Location: Montevideo, Uruguay
Device: Kindle3
Thank you Starson for your answer.
I'll take a look at the thread. I have a lot to learn

Thanks again!
romualdinho is offline   Reply With Quote
Old 09-26-2011, 11:34 PM   #4
romualdinho
Junior Member
romualdinho began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2011
Location: Montevideo, Uruguay
Device: Kindle3
I used Pahan's code to get rid of already downloaded items and also filtered the code, but I couldn't resolve the main problem: not to get repeated articles from different feeds in the same run. I've been spending some time with this without success.

Though I've done some things in PHP for websites, I couldn't say I'm a programmer, so I will try a little more, and in case of failing again, I'll have to skip the articles in the kindle while reading

Regards.

PS: that's my code now:
Spoiler:

Code:
from calibre.constants import config_dir, CONFIG_DIR_MODE
import re, os, os.path, urllib
from hashlib import md5

class OnlyLatestRecipe(BasicNewsRecipe):
    title                 = u'Mongabay'
    oldest_article        = 30
    max_articles_per_feed = 50
    auto_cleanup          = True
    description           = 'Mongabay.com seeks to raise interest in and appreciation of wild lands and wildlife, while examining the impact of emerging trends in climate, technology, economics, and finance on conservation and development'
    category              = 'Ecología'
    language              = 'en'
    remove_tags           = [dict(name='p', attrs={'class':'hide'})]
    auto_cleanup_keep     = '//div[@class="imageWrap"]'
    remove_javascript     = True
    extra_css             = 'font span.italic { display: block; padding-bottom: 12px; }'
    preprocess_regexps    = [
        (re.compile(r'Mongabay\.com seeks.*and development\.', re.DOTALL|re.IGNORECASE),
        lambda match: ''),
        (re.compile(r'Please consider the environment.*PDF version</a>', re.DOTALL|re.IGNORECASE), # I AM considering the environment
        lambda match: ''),
        (re.compile(r'<A HREF="http://www.mongabay.com/copyright.htm">Copyright mongabay 2009', re.DOTALL|re.IGNORECASE),
        lambda match: ''),
        (re.compile(r' - Print', re.DOTALL|re.IGNORECASE),
        lambda match: ''),
        (re.compile(r'(<br\s*\/?>\s*){3,}', re.DOTALL|re.IGNORECASE),
        lambda match: ' <br /><br /> '),
        (re.compile(r'<table', re.DOTALL|re.IGNORECASE),
        lambda match: '<div> <table'),
        (re.compile(r'</table>', re.DOTALL|re.IGNORECASE),
        lambda match: '</table></div> <br />'),
        (re.compile(r'<div> <table align=right>', re.DOTALL|re.IGNORECASE),
        lambda match: '<div class="imageWrap"> <table align="left">'),
        (re.compile(r'<td width=20></td>', re.DOTALL|re.IGNORECASE),
        lambda match: '')
		
    ]
    conversion_options = {
         'comments'        : description
        ,'language'        : language
		,'linearize_tables': True
    }
    feeds   = [
                 (u'Amazon', u'http://news.mongabay.com/xml/amazon1.xml')
                ,(u'Species discovery', u'http://news.mongabay.com/xml/species_discovery1.xml')
                ,(u'Rainforest animals', u'http://news.mongabay.com/xml/rainforest%20animals1.xml')
                ,(u'Cats', u'http://news.mongabay.com/xml/cats1.xml')
                ,(u'Pantanal', u'http://news.mongabay.com/xml/pantanal1.xml')
                ,(u'Boreal forests', u'http://news.mongabay.com/xml/boreal_forests1.xml')
                ,(u'Atlantic Forest', 'http://news.mongabay.com/xml/Atlantic%20Forest1.xml')
                ,(u'Panama', 'http://news.mongabay.com/xml/Panama1.xml')
            ]

    def print_version(self, url):
        return url.replace('http://', 'http://print.')

    def parse_feeds(self):
        recipe_dir = os.path.join(config_dir,'recipes')
        hash_dir = os.path.join(recipe_dir,'recipe_storage')
        feed_dir = os.path.join(hash_dir,self.title.encode('utf-8').replace('/',':'))
        if not os.path.isdir(feed_dir):
            os.makedirs(feed_dir,mode=CONFIG_DIR_MODE)

        feeds = BasicNewsRecipe.parse_feeds(self)

        for feed in feeds:
            feed_hash = urllib.quote(feed.title.encode('utf-8'),safe='')
            feed_fn = os.path.join(feed_dir,feed_hash)

            past_items = set()
            if os.path.exists(feed_fn):
               with file(feed_fn) as f:
                   for h in f:
                       past_items.add(h.strip())
                       
            cur_items = set()
            for article in feed.articles[:]:
                item_hash = md5()
                if article.content: item_hash.update(article.content.encode('utf-8'))
                if article.summary: item_hash.update(article.summary.encode('utf-8'))
                item_hash = item_hash.hexdigest()
                if article.url:
                    item_hash = article.url + ':' + item_hash
                cur_items.add(item_hash)
                if item_hash in past_items:
                    feed.articles.remove(article)
            with file(feed_fn,'w') as f:
                for h in cur_items:
                    f.write(h+'\n')

        remove = [f for f in feeds if len(f) == 0 and
                self.remove_empty_feeds]
        for f in remove:
            feeds.remove(f)

        return feeds
romualdinho is offline   Reply With Quote
Old 03-14-2012, 06:26 PM   #5
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
Did you ever find a solution for removing duplicates from multiple feeds in the same run? If so, is it re-usable code?
I've tried using the code adapted from the NewScientist below
Code:
...
    filterDuplicates = True
    url_list = []
...
    def print_version(self, url):
        if self.filterDuplicates:
            if url in self.url_list:
                return
        return url.replace('/article/', '/printarticle/')
but when I use it, I get an epub with just empty feeds...it takes out all the URLs

Last edited by adoucette; 03-14-2012 at 09:27 PM.
adoucette is offline   Reply With Quote
Advert
Old 09-24-2012, 09:27 PM   #6
romualdinho
Junior Member
romualdinho began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2011
Location: Montevideo, Uruguay
Device: Kindle3
Sorry for the late response, but unfortunately I couldn't find a solution. Last months I've been very busy to do some research.

Now I'm skipping the duplicated articles as I read.
romualdinho is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Multiple Kindles, Multiple News Feeds filmprof Recipes 10 02-20-2012 10:38 AM
option to add multiple custom OPDS feeds ilovejedd EPUBReader 2 09-17-2011 02:18 PM
Multiple News Feeds Problem on Kindle Mixx Calibre 4 05-28-2011 05:02 PM
Displaying Multiple RSS Feeds in a Single Section? commandercup Recipes 5 03-01-2011 05:34 PM
One Recipe, Multiple Feeds, Different Printer-Friendly Subs DTM Recipes 9 02-11-2011 01:04 PM


All times are GMT -4. The time now is 02:42 AM.


MobileRead.com is a privately owned, operated and funded community.