MobileRead Forums - View Single Post - Articles repeated in different feed sections

scissors · 10-07-2012, 08:22 AM

Like others i'm aware some sites have links to the same article in more than one feed.

here is an attempt at sorting this, based on the ideas in the re-usable section.

Please bare in mind I am no programmer and have to google python examples to make this. So it's clunky and crude.

The basic idea is.

repeat
Open a txt file.
Get feed url
Is the article title in the txt file?
No - it's unique,
download it
append article title to txt file.
Yes - it must be in a previous section
don't download it
don't append it to the file (it's already in there)
until no more articles.

here it is implemented in the bbc nature recipe (which always has repeats)
I've also tried it in Country file - this also seems to work.

Spoiler:

Code:

from calibre.constants import config_dir, CONFIG_DIR_MODE
import os, os.path, urllib
from hashlib import md5

#declare global temp file
Feeds_File = config_dir+'\\feeds.txt'

# needed for getting rid of repeat feeds

class AdvancedUserRecipe1339395836(BasicNewsRecipe):
    title          = u'BBC Nature 3'
    cover_url = 'http://news.bbcimg.co.uk/img/3_0_0/cream/hi/nature/nature-blocks.gif'
    __author__ = 'Dave Asbury'
    description = 'Author D.Asbury. News From The BBC Nature Website'
    # last updated 7/10/12
    language = 'en_GB'
    oldest_article = 32
    max_articles_per_feed = 25
    remove_empty_feeds = True
    remove_javascript     = True
    no_stylesheets = True
    auto_cleanup = True
    #global variables required for getting rid of duplicate articles
    article_already_exists = False
    #feed_hash = ''
   
    remove_tags = [
	dict(attrs={'class' : ['player']}),
                           ]
    feeds = [
                          (u'BBC Nature', u'http://feeds.bbci.co.uk/nature/rss.xml'),
     	      (u'BBC Nature Features', u'http://feeds.bbci.co.uk/nature/features/rss.xml'),
	      (u'BBC Nature - Whats New', u'http://www.bbc.co.uk/nature/wildlife/by/updated.rss'),
   

		]
# start of code to get rid of duplicates
    print '@@@@@@@',Feeds_File
    def parse_feeds(self):
        feeds = BasicNewsRecipe.parse_feeds(self)
        print 'create empty file'
        print
       
        #open and close empty file - otherwise crashes as you can't append a file that doesn't exist?

        read_file=open(Feeds_File,'w+')
        read_file.close()

        # repeat for all feeds
        for feed in feeds:
            print 'Feed file = ',Feeds_File
            
            # for each section do
            print
            print 'Feed section is ',feed.title
            # for each artcile in each section check if it's in the feeds file
            for article in feed.articles[:]:
                 article_already_exists = False
                 
                 print
                #open the file and reads lines of text
                 read_file=open(Feeds_File)
                 while 1:
                          line=read_file.readline()
                          print
                          print'****'
                          print 'Value of line:',line
                          print 'article.title is:',article.title
                          if str(line) == str(article.title+'\n'):
                             article_already_exists = True
                             print 'repeated article'
                             break
                          print'*****'
                          print                         
                          # eof reached   
                          if not line: break
                          
                 read_file.close()
                 # couldn't find article so write it to file
                 if article_already_exists == False:
                    read_file=open(Feeds_File,'a')
                    read_file.write(article.title+'\n')
                    read_file.close()
                 if article_already_exists == True:
                    article.url ='' # delete the url so won't download
        return feeds

# end of code to get rid of duplicates

extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:medium;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''