MobileRead Forums - View Single Post - Seeking help with simple recipe for seedmagazine.com

dncohen · 11-05-2012, 03:38 PM

Hi All,

This is my first recipe and first python code. That may explain any possibly stupid questions.

I'm trying to emulate existing recipes to get articles from a site that has no RSS feed. In this case, http://www.seedmagazine.com.

I've looked at their source HTML and, so far as I understand it, to parse the index I want every link on the page that goes to an article. That means a URL that starts http://seedmagazine.com/content/article/... (Actually, I want to get the print version of those articles, which is a pretty easy substitution.

I'm attaching my current recipe. It almost works, but instead of getting all the article links on the main page, it gets only the first two. I can't seem to figure out why. Shouldn't soup.findAll('a') return all the anchor tags on the page?

I'd appreciate any advice to get past that problem. And any advice in general because I really don't know how to put the finishing touches on this recipe.

Thanks! -Dave

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup


class seedmagazine(BasicNewsRecipe):
    title = u'Seed Magazine'
    description = u'seedmagazine.com'
    
    oldest_article = 31
    max_articles_per_feed = 5 # keep this number small until recipe works


    def parse_index(self):
        articles = []
        feeds = []
        seen = set([])
        
        soup = self.index_to_soup('http://www.seedmagazine.com')

        for link in soup.findAll('a'):
            url = link['href']
            title = self.tag_to_string(link)
            
            if (title and url.find('/content/article/') > 0) :
                articles.append({'title': title,
                                 'url': self.print_version(url),
                                 })

        if (articles):
            feeds.append((self.title, articles))

        return feeds

    
        
    
    def print_version(self, url):
        return url.replace('/article/', '/print/')

11-05-2012, 03:38 PM	#1
dncohen Junior Member Posts: 2 Karma: 10 Join Date: Nov 2012 Device: kindle fire	Seeking help with simple recipe for seedmagazine.com Hi All, This is my first recipe and first python code. That may explain any possibly stupid questions. I'm trying to emulate existing recipes to get articles from a site that has no RSS feed. In this case, http://www.seedmagazine.com. I've looked at their source HTML and, so far as I understand it, to parse the index I want every link on the page that goes to an article. That means a URL that starts http://seedmagazine.com/content/article/... (Actually, I want to get the print version of those articles, which is a pretty easy substitution. I'm attaching my current recipe. It almost works, but instead of getting all the article links on the main page, it gets only the first two. I can't seem to figure out why. Shouldn't soup.findAll('a') return all the anchor tags on the page? I'd appreciate any advice to get past that problem. And any advice in general because I really don't know how to put the finishing touches on this recipe. Thanks! -Dave Code: import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class seedmagazine(BasicNewsRecipe): title = u'Seed Magazine' description = u'seedmagazine.com' oldest_article = 31 max_articles_per_feed = 5 # keep this number small until recipe works def parse_index(self): articles = [] feeds = [] seen = set([]) soup = self.index_to_soup('http://www.seedmagazine.com') for link in soup.findAll('a'): url = link['href'] title = self.tag_to_string(link) if (title and url.find('/content/article/') > 0) : articles.append({'title': title, 'url': self.print_version(url), }) if (articles): feeds.append((self.title, articles)) return feeds def print_version(self, url): return url.replace('/article/', '/print/')