View Single Post
Old 11-05-2012, 02:38 PM   #1
dncohen
Junior Member
dncohen began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Nov 2012
Device: kindle fire
Seeking help with simple recipe for seedmagazine.com

Hi All,

This is my first recipe and first python code. That may explain any possibly stupid questions.

I'm trying to emulate existing recipes to get articles from a site that has no RSS feed. In this case, http://www.seedmagazine.com.

I've looked at their source HTML and, so far as I understand it, to parse the index I want every link on the page that goes to an article. That means a URL that starts http://seedmagazine.com/content/article/... (Actually, I want to get the print version of those articles, which is a pretty easy substitution.

I'm attaching my current recipe. It almost works, but instead of getting all the article links on the main page, it gets only the first two. I can't seem to figure out why. Shouldn't soup.findAll('a') return all the anchor tags on the page?

I'd appreciate any advice to get past that problem. And any advice in general because I really don't know how to put the finishing touches on this recipe.

Thanks! -Dave

Code:
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup


class seedmagazine(BasicNewsRecipe):
    title = u'Seed Magazine'
    description = u'seedmagazine.com'
    
    oldest_article = 31
    max_articles_per_feed = 5 # keep this number small until recipe works


    def parse_index(self):
        articles = []
        feeds = []
        seen = set([])
        
        soup = self.index_to_soup('http://www.seedmagazine.com')

        for link in soup.findAll('a'):
            url = link['href']
            title = self.tag_to_string(link)
            
            if (title and url.find('/content/article/') > 0) :
                articles.append({'title': title,
                                 'url': self.print_version(url),
                                 })

        if (articles):
            feeds.append((self.title, articles))

        return feeds

    
        
    
    def print_version(self, url):
        return url.replace('/article/', '/print/')
dncohen is offline   Reply With Quote