View Single Post
Old 02-18-2012, 12:29 PM   #1
rjgrigaitis
Junior Member
rjgrigaitis began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2012
Device: Sony PRS-350
First Article Repeated for Friday Fax

I'm new to Calibre and Python. I've only read up to Chapter 6 in the Python Tutorial. I'm a C++ programmer that's done almost nothing but PHP programming for the last seven years. Thus, I hardly know what I'm doing with Calibre recipes.

This is my first complex recipe:

Code:
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1328808344(BasicNewsRecipe):
    title          = u'C-Fam Friday Fax'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup = True

    def parse_index(self):
        soup = self.index_to_soup('http://www.c-fam.org/fridayfax/')
        articles = []
        feeds = []

        for div in soup.findAll('div'):
            a = div.find('a', href=True, attrs={'class':'ffArchiveLink'})
            if not a:
                continue

            url = 'http://www.c-fam.org/' + a['href']
            title = ''.join(a.findAll(text=True, recursive=False)).strip()
            i = div.find('i')
            if not i:
                pubdate = strftime('%a, %d %b')
            else:
                pubdate = ''.join(i.findAll(text=True, recursive=False)).strip()

            description = ''
            articles.append({'title' : title,
                                       'url' : url,
                                       'date' : pubdate,
                                       'description' : description})

        feeds.append((self.title, articles))

        return feeds
The first article gets repeated 3 time though. Therefore I added this code:

Code:
            def getSetURL(articles):
                ans = []
                for article in articles:
                    ans.append(article['url'])
                return ans

            url = 'http://www.c-fam.org/' + a['href']
            if url in getSetURL(articles):
                continue
I'm sure this code shouldn't be necessary, but I can't figure out how to get rid of the repeats of the first article without it. What am I doing wrong with the original code? If nothing, is the code I added the best way to get rid of the repeated articles?
rjgrigaitis is offline   Reply With Quote