View Single Post
Old 11-09-2009, 06:40 PM   #858
evanmaastrigt
Connoisseur
evanmaastrigt doesn't litterevanmaastrigt doesn't litter
 
Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
Need some help with custome recipe

Hi,

My Python is, after 8 years, a little rusty. But I like Calibre and it's concept of plug-in recipes, so I gave it a try and produced the following recipe:

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class FokkeEnSukkeRecipe(BasicNewsRecipe) :
	title          = u'Fokke en Sukke'
	no_stylesheets = True
	INDEX = 'http://foksuk.nl'
	
	keep_only_tags = [dict(name='div', attrs={'class' : 'cartoon'})]
	remove_tags = [dict(name = 'div', attrs = {'class' : 'selectcartoon'})]
	
	def parse_index(self) :
		dayNames = ['maandag', 'dinsdag', 'woensdag', 'donderdag', 'vrijdag', 'zaterdag & zondag']
		soup = self.index_to_soup(self.INDEX)
		
		index = soup.find('div', attrs={'class' : 'selectcartoon'})
		links = index.findAll('a')
		maxIndex = len(links) - 1
		articles = []
		for i in range(len(links)) :
			if i == 0 :
				continue
			
			if links[i].renderContents() in dayNames :
				article = {'title' : links[i].renderContents(), 'date' : u'', 'url'  : self.INDEX + links[i]['href'], 'description' : ''}
				articles.append(article)
					
		week = index.find('span', attrs={'class' : 'week'}).renderContents()
		
		return [[week, articles]]
					
	def preprocess_html(self, soup) :
		cartoon = soup.find('div', attrs={'class' : 'cartoon'})
		if cartoon :
			return cartoon
		else :
			return soup
Now this actually seems to work, which is nice. But it is not completely finished yet. But before I continue I like to now why this works. If I comment out the preprocess_html() override it cannot find the cartoons I'm after anymore. Which I don't really understand.

Now what I'm doing here is maybe a little weird. For an index I parse a webpage. The returned list of articles have url's that point to similar pages as the index, the only difference being that the div with a css-class of 'cartoon' contains a different images for every article.

My theory is that Calibre, after receiving my custom index, tries to parse all the url's and bombs out because that causes a lot of recursion. Implementing preprocess_html() somehow stops that.

But as I said, my Python is rusty. So if anyone could give me some pointers I would greatly appriciate it.

Edwin
evanmaastrigt is offline