Like others i'm aware some sites have links to the same article in more than one feed.
here is an attempt at sorting this, based on the ideas in the re-usable section.
Please bare in mind I am no programmer and have to google python examples to make this. So it's clunky and crude.
The basic idea is.
repeat
Open a txt file.
Get feed url
Is the article title in the txt file?
No - it's unique,
download it
append article title to txt file.
Yes - it must be in a previous section
don't download it
don't append it to the file (it's already in there)
until no more articles.
here it is implemented in the bbc nature recipe (which always has repeats)
I've also tried it in Country file - this also
seems to work.
Spoiler:
Code:
from calibre.constants import config_dir, CONFIG_DIR_MODE
import os, os.path, urllib
from hashlib import md5
#declare global temp file
Feeds_File = config_dir+'\\feeds.txt'
# needed for getting rid of repeat feeds
class AdvancedUserRecipe1339395836(BasicNewsRecipe):
title = u'BBC Nature 3'
cover_url = 'http://news.bbcimg.co.uk/img/3_0_0/cream/hi/nature/nature-blocks.gif'
__author__ = 'Dave Asbury'
description = 'Author D.Asbury. News From The BBC Nature Website'
# last updated 7/10/12
language = 'en_GB'
oldest_article = 32
max_articles_per_feed = 25
remove_empty_feeds = True
remove_javascript = True
no_stylesheets = True
auto_cleanup = True
#global variables required for getting rid of duplicate articles
article_already_exists = False
#feed_hash = ''
remove_tags = [
dict(attrs={'class' : ['player']}),
]
feeds = [
(u'BBC Nature', u'http://feeds.bbci.co.uk/nature/rss.xml'),
(u'BBC Nature Features', u'http://feeds.bbci.co.uk/nature/features/rss.xml'),
(u'BBC Nature - Whats New', u'http://www.bbc.co.uk/nature/wildlife/by/updated.rss'),
]
# start of code to get rid of duplicates
print '@@@@@@@',Feeds_File
def parse_feeds(self):
feeds = BasicNewsRecipe.parse_feeds(self)
print 'create empty file'
print
#open and close empty file - otherwise crashes as you can't append a file that doesn't exist?
read_file=open(Feeds_File,'w+')
read_file.close()
# repeat for all feeds
for feed in feeds:
print 'Feed file = ',Feeds_File
# for each section do
print
print 'Feed section is ',feed.title
# for each artcile in each section check if it's in the feeds file
for article in feed.articles[:]:
article_already_exists = False
print
#open the file and reads lines of text
read_file=open(Feeds_File)
while 1:
line=read_file.readline()
print
print'****'
print 'Value of line:',line
print 'article.title is:',article.title
if str(line) == str(article.title+'\n'):
article_already_exists = True
print 'repeated article'
break
print'*****'
print
# eof reached
if not line: break
read_file.close()
# couldn't find article so write it to file
if article_already_exists == False:
read_file=open(Feeds_File,'a')
read_file.write(article.title+'\n')
read_file.close()
if article_already_exists == True:
article.url ='' # delete the url so won't download
return feeds
# end of code to get rid of duplicates
extra_css = '''
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:medium;}
h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
'''