Hi,
My Python is, after 8 years, a little rusty. But I like Calibre and it's concept of plug-in recipes, so I gave it a try and produced the following recipe:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class FokkeEnSukkeRecipe(BasicNewsRecipe) :
title = u'Fokke en Sukke'
no_stylesheets = True
INDEX = 'http://foksuk.nl'
keep_only_tags = [dict(name='div', attrs={'class' : 'cartoon'})]
remove_tags = [dict(name = 'div', attrs = {'class' : 'selectcartoon'})]
def parse_index(self) :
dayNames = ['maandag', 'dinsdag', 'woensdag', 'donderdag', 'vrijdag', 'zaterdag & zondag']
soup = self.index_to_soup(self.INDEX)
index = soup.find('div', attrs={'class' : 'selectcartoon'})
links = index.findAll('a')
maxIndex = len(links) - 1
articles = []
for i in range(len(links)) :
if i == 0 :
continue
if links[i].renderContents() in dayNames :
article = {'title' : links[i].renderContents(), 'date' : u'', 'url' : self.INDEX + links[i]['href'], 'description' : ''}
articles.append(article)
week = index.find('span', attrs={'class' : 'week'}).renderContents()
return [[week, articles]]
def preprocess_html(self, soup) :
cartoon = soup.find('div', attrs={'class' : 'cartoon'})
if cartoon :
return cartoon
else :
return soup
Now this actually seems to work, which is nice. But it is not completely finished yet. But before I continue I like to now why this works. If I comment out the preprocess_html() override it cannot find the cartoons I'm after anymore. Which I don't really understand.
Now what I'm doing here is maybe a little weird. For an index I parse a webpage. The returned list of articles have url's that point to similar pages as the index, the only difference being that the div with a css-class of 'cartoon' contains a different images for every article.
My theory is that Calibre, after receiving my custom index, tries to parse all the url's and bombs out because that causes a lot of recursion. Implementing preprocess_html() somehow stops that.
But as I said, my Python is rusty. So if anyone could give me some pointers I would greatly appriciate it.
Edwin