Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-07-2012, 08:22 AM   #1
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Articles repeated in different feed sections

Like others i'm aware some sites have links to the same article in more than one feed.

here is an attempt at sorting this, based on the ideas in the re-usable section.

Please bare in mind I am no programmer and have to google python examples to make this. So it's clunky and crude.

The basic idea is.

repeat
Open a txt file.
Get feed url
Is the article title in the txt file?
No - it's unique,
download it
append article title to txt file.
Yes - it must be in a previous section
don't download it
don't append it to the file (it's already in there)
until no more articles.

here it is implemented in the bbc nature recipe (which always has repeats)
I've also tried it in Country file - this also seems to work.

Spoiler:
Code:
from calibre.constants import config_dir, CONFIG_DIR_MODE
import os, os.path, urllib
from hashlib import md5

#declare global temp file
Feeds_File = config_dir+'\\feeds.txt'

# needed for getting rid of repeat feeds

class AdvancedUserRecipe1339395836(BasicNewsRecipe):
    title          = u'BBC Nature 3'
    cover_url = 'http://news.bbcimg.co.uk/img/3_0_0/cream/hi/nature/nature-blocks.gif'
    __author__ = 'Dave Asbury'
    description = 'Author D.Asbury. News From The BBC Nature Website'
    # last updated 7/10/12
    language = 'en_GB'
    oldest_article = 32
    max_articles_per_feed = 25
    remove_empty_feeds = True
    remove_javascript     = True
    no_stylesheets = True
    auto_cleanup = True
    #global variables required for getting rid of duplicate articles
    article_already_exists = False
    #feed_hash = ''
   
    remove_tags = [
	dict(attrs={'class' : ['player']}),
                           ]
    feeds = [
                          (u'BBC Nature', u'http://feeds.bbci.co.uk/nature/rss.xml'),
     	      (u'BBC Nature Features', u'http://feeds.bbci.co.uk/nature/features/rss.xml'),
	      (u'BBC Nature - Whats New', u'http://www.bbc.co.uk/nature/wildlife/by/updated.rss'),
   

		]
# start of code to get rid of duplicates
    print '@@@@@@@',Feeds_File
    def parse_feeds(self):
        feeds = BasicNewsRecipe.parse_feeds(self)
        print 'create empty file'
        print
       
        #open and close empty file - otherwise crashes as you can't append a file that doesn't exist?

        read_file=open(Feeds_File,'w+')
        read_file.close()

        # repeat for all feeds
        for feed in feeds:
            print 'Feed file = ',Feeds_File
            
            # for each section do
            print
            print 'Feed section is ',feed.title
            # for each artcile in each section check if it's in the feeds file
            for article in feed.articles[:]:
                 article_already_exists = False
                 
                 print
                #open the file and reads lines of text
                 read_file=open(Feeds_File)
                 while 1:
                          line=read_file.readline()
                          print
                          print'****'
                          print 'Value of line:',line
                          print 'article.title is:',article.title
                          if str(line) == str(article.title+'\n'):
                             article_already_exists = True
                             print 'repeated article'
                             break
                          print'*****'
                          print                         
                          # eof reached   
                          if not line: break
                          
                 read_file.close()
                 # couldn't find article so write it to file
                 if article_already_exists == False:
                    read_file=open(Feeds_File,'a')
                    read_file.write(article.title+'\n')
                    read_file.close()
                 if article_already_exists == True:
                    article.url ='' # delete the url so won't download
        return feeds

# end of code to get rid of duplicates

extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:medium;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''
scissors is offline   Reply With Quote
Old 10-07-2012, 08:30 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
IIRC there's a technique for doing this in the recipes -reusable code sticky thread at the top of this forum.

https://www.mobileread.com/forums/sho...5&postcount=10
kovidgoyal is offline   Reply With Quote
Advert
Old 10-07-2012, 08:48 AM   #3
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Quote:
Originally Posted by kovidgoyal View Post
IIRC there's a technique for doing this in the recipes -reusable code sticky thread at the top of this forum.

https://www.mobileread.com/forums/sho...5&postcount=10

Hi Kovid,

I tried it.

It checks for an article in a section being downloaded in the past.
It creates a file for each section.

so if article x is downloaded in section 1 it will not be downloaded into section 1 ever again.

However, if article x is in section 1, in this months mag, it can still be in section 2 in the same month. Same article in 2 sections.

My way means no article is repeated in any section within a single download.

(at least that's what happened for me) I hope that's the case cus that took me all morning (i'm not a programmer)

Last edited by scissors; 10-07-2012 at 09:40 AM.
scissors is offline   Reply With Quote
Old 10-07-2012, 12:44 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Do you want duplicates to be removed across downloads or only across feeds in a single download? If the latter, then you dont need to use a file, just store the titles in memory.

And generally speaking, you should use URLs as the key to check for duplicates, not the title. Less likely to have false duplicates that way.
kovidgoyal is offline   Reply With Quote
Old 10-07-2012, 01:06 PM   #5
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Quote:
Originally Posted by kovidgoyal View Post
Do you want duplicates to be removed across downloads or only across feeds in a single download? If the latter, then you dont need to use a file, just store the titles in memory.

And generally speaking, you should use URLs as the key to check for duplicates, not the title. Less likely to have false duplicates that way.
Only single download.
Yes i realise there is a risk of duplicates and URLS get used.
I used a file because it was a way i could achieve what i wanted.

The recipes I knock together are the whole of my python knowledge and experience.
I am not a programmer - feel free (anyone) to make them better.
scissors is offline   Reply With Quote
Advert
Old 10-07-2012, 01:21 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I added an API to the BasicNewsRecipe class to do this. See http://bazaar.launchpad.net/~kovid/c...revision/13440

I haven't tested it, you can do that yourself after the next calibre release. All you need to do is add

ignore_duplicate_articles = {'title'}

to your recipe.
kovidgoyal is offline   Reply With Quote
Old 10-17-2012, 01:54 PM   #7
_reader
Member
_reader doesn't litter_reader doesn't litter
 
Posts: 24
Karma: 142
Join Date: Sep 2010
Device: K3, KPW
Hi Kovid, Scissors,

I added
Code:
ignore_duplicate_articles = {'title', 'url'}
to the RichmondTimesDispatch recipe which gets lots of repeated articles in different feeds. WOW! It works great.

Kovid, Thanks for giving us this useful addition; and of course THANKS for Calibre.

Last edited by _reader; 10-17-2012 at 03:24 PM. Reason: edit
_reader is offline   Reply With Quote
Old 10-19-2012, 11:02 AM   #8
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Quote:
Originally Posted by _reader View Post
Hi Kovid, Scissors,

I added
Code:
ignore_duplicate_articles = {'title', 'url'}
to the RichmondTimesDispatch recipe which gets lots of repeated articles in different feeds. WOW! It works great.

Kovid, Thanks for giving us this useful addition; and of course THANKS for Calibre.
yes it's good. But at the same time a bit depressing as code that took me 1/2 a day to codge together was superceeded in minutes...

oh well, that's progress
scissors is offline   Reply With Quote
Old 10-19-2012, 11:27 AM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Writing your own code is the best way to learn. That's how I learned to program
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fix for Kindle 3 problem with Sections and Articles nickredding Development 22 08-19-2011 06:58 PM
Show articles only, hide sections sjk9 Recipes 1 04-11-2011 11:04 AM
Create Article Sections From Feed? Finbar127 Recipes 5 02-26-2011 08:55 AM
Insert Hyperlinks in Feed Articles Bushwil Recipes 1 01-21-2011 02:51 PM
Sorting articles of RSS feed miwie Recipes 1 11-21-2010 01:02 AM


All times are GMT -4. The time now is 06:24 PM.


MobileRead.com is a privately owned, operated and funded community.