MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

evanmaastrigt · 11-09-2009, 06:40 PM

Hi,

My Python is, after 8 years, a little rusty. But I like Calibre and it's concept of plug-in recipes, so I gave it a try and produced the following recipe:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class FokkeEnSukkeRecipe(BasicNewsRecipe) :
	title          = u'Fokke en Sukke'
	no_stylesheets = True
	INDEX = 'http://foksuk.nl'
	
	keep_only_tags = [dict(name='div', attrs={'class' : 'cartoon'})]
	remove_tags = [dict(name = 'div', attrs = {'class' : 'selectcartoon'})]
	
	def parse_index(self) :
		dayNames = ['maandag', 'dinsdag', 'woensdag', 'donderdag', 'vrijdag', 'zaterdag & zondag']
		soup = self.index_to_soup(self.INDEX)
		
		index = soup.find('div', attrs={'class' : 'selectcartoon'})
		links = index.findAll('a')
		maxIndex = len(links) - 1
		articles = []
		for i in range(len(links)) :
			if i == 0 :
				continue
			
			if links[i].renderContents() in dayNames :
				article = {'title' : links[i].renderContents(), 'date' : u'', 'url'  : self.INDEX + links[i]['href'], 'description' : ''}
				articles.append(article)
					
		week = index.find('span', attrs={'class' : 'week'}).renderContents()
		
		return [[week, articles]]
					
	def preprocess_html(self, soup) :
		cartoon = soup.find('div', attrs={'class' : 'cartoon'})
		if cartoon :
			return cartoon
		else :
			return soup

Now this actually seems to work, which is nice. But it is not completely finished yet. But before I continue I like to now why this works. If I comment out the preprocess_html() override it cannot find the cartoons I'm after anymore. Which I don't really understand.

Now what I'm doing here is maybe a little weird. For an index I parse a webpage. The returned list of articles have url's that point to similar pages as the index, the only difference being that the div with a css-class of 'cartoon' contains a different images for every article.

My theory is that Calibre, after receiving my custom index, tries to parse all the url's and bombs out because that causes a lot of recursion. Implementing preprocess_html() somehow stops that.

But as I said, my Python is rusty. So if anyone could give me some pointers I would greatly appriciate it.

Edwin

11-09-2009, 06:40 PM	#858
evanmaastrigt Connoisseur Posts: 78 Karma: 192 Join Date: Nov 2009 Device: Sony PRS-600	Need some help with custome recipe Hi, My Python is, after 8 years, a little rusty. But I like Calibre and it's concept of plug-in recipes, so I gave it a try and produced the following recipe: Code: from calibre.web.feeds.news import BasicNewsRecipe class FokkeEnSukkeRecipe(BasicNewsRecipe) : title = u'Fokke en Sukke' no_stylesheets = True INDEX = 'http://foksuk.nl' keep_only_tags = [dict(name='div', attrs={'class' : 'cartoon'})] remove_tags = [dict(name = 'div', attrs = {'class' : 'selectcartoon'})] def parse_index(self) : dayNames = ['maandag', 'dinsdag', 'woensdag', 'donderdag', 'vrijdag', 'zaterdag & zondag'] soup = self.index_to_soup(self.INDEX) index = soup.find('div', attrs={'class' : 'selectcartoon'}) links = index.findAll('a') maxIndex = len(links) - 1 articles = [] for i in range(len(links)) : if i == 0 : continue if links[i].renderContents() in dayNames : article = {'title' : links[i].renderContents(), 'date' : u'', 'url' : self.INDEX + links[i]['href'], 'description' : ''} articles.append(article) week = index.find('span', attrs={'class' : 'week'}).renderContents() return [[week, articles]] def preprocess_html(self, soup) : cartoon = soup.find('div', attrs={'class' : 'cartoon'}) if cartoon : return cartoon else : return soup Now this actually seems to work, which is nice. But it is not completely finished yet. But before I continue I like to now why this works. If I comment out the preprocess_html() override it cannot find the cartoons I'm after anymore. Which I don't really understand. Now what I'm doing here is maybe a little weird. For an index I parse a webpage. The returned list of articles have url's that point to similar pages as the index, the only difference being that the div with a css-class of 'cartoon' contains a different images for every article. My theory is that Calibre, after receiving my custom index, tries to parse all the url's and bombs out because that causes a lot of recursion. Implementing preprocess_html() somehow stops that. But as I said, my Python is rusty. So if anyone could give me some pointers I would greatly appriciate it. Edwin