My first (non-RSS) recipe is almost working. It is a little odd because it is not pulling up an article per se but IMDB search results.
When I had auto_cleanup set to False, it made me appreciate the value of auto_cleanup. So now when I use it, I find I need auto_cleanup_keep, but for some reason the line
Code:
auto_cleanup_keep = "//div[@id='titleStoryLine']"
seems to grab the DidYouKnow section even though DidYouKnow has its own ID. I tried different DIVs that I thought would also work, and I tried linearize_tables but it didn't seem to do anything. Any suggestion?
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class IMDBTopMoviesHi9999(BasicNewsRecipe):
language = 'en'
__author__ = 'ireadtheinternet'
oldest_article = 999999
max_articles_per_feed = 9999
no_stylesheets = True
no_javascript = True
auto_cleanup = True
auto_cleanup_keep = "//div[@id='titleStoryLine']"
# For some unknown reason this also keeps //div[@id='titleDidYouKnow']
conversion_options = {'linearize_tables' : True}
def parse_index(self):
toc = self.index_to_soup('http://www.imdb.com/search/title?languages=hi&num_votes=500,&production_status=released&sort=year,desc&title_type=feature&user_rating=7.0,10')
articles = []
for row in toc.findAll('td', attrs={'class':'title'}):
release_year = self.tag_to_string(row.find('span', attrs={'class':'year_type'}))
link = row.find('a')
title = self.tag_to_string(link) + " " + release_year
url = 'http://www.imdb.com' + link['href']
self.log('Found article:', link)
self.log('\t', url)
articles.append({'title':title, 'url':url, 'description':''})
self.title = self.tag_to_string(toc.find('h1'))
# only one section, so...
return [(self.title, articles)]