Hi All,
This is my first recipe and first python code. That may explain any possibly stupid questions.
I'm trying to emulate existing recipes to get articles from a site that has no RSS feed. In this case,
http://www.seedmagazine.com.
I've looked at their source HTML and, so far as I understand it, to parse the index I want every link on the page that goes to an article. That means a URL that starts
http://seedmagazine.com/content/article/... (Actually, I want to get the print version of those articles, which is a pretty easy substitution.
I'm attaching my current recipe. It almost works, but instead of getting all the article links on the main page, it gets only the first two. I can't seem to figure out why. Shouldn't soup.findAll('a') return all the anchor tags on the page?
I'd appreciate any advice to get past that problem. And any advice in general because I really don't know how to put the finishing touches on this recipe.
Thanks! -Dave
Code:
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class seedmagazine(BasicNewsRecipe):
title = u'Seed Magazine'
description = u'seedmagazine.com'
oldest_article = 31
max_articles_per_feed = 5 # keep this number small until recipe works
def parse_index(self):
articles = []
feeds = []
seen = set([])
soup = self.index_to_soup('http://www.seedmagazine.com')
for link in soup.findAll('a'):
url = link['href']
title = self.tag_to_string(link)
if (title and url.find('/content/article/') > 0) :
articles.append({'title': title,
'url': self.print_version(url),
})
if (articles):
feeds.append((self.title, articles))
return feeds
def print_version(self, url):
return url.replace('/article/', '/print/')