View Single Post
Old 10-26-2014, 01:48 PM   #1
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Question auto_cleanup_keep keeping too much

My first (non-RSS) recipe is almost working. It is a little odd because it is not pulling up an article per se but IMDB search results.

When I had auto_cleanup set to False, it made me appreciate the value of auto_cleanup. So now when I use it, I find I need auto_cleanup_keep, but for some reason the line
Code:
auto_cleanup_keep = "//div[@id='titleStoryLine']"
seems to grab the DidYouKnow section even though DidYouKnow has its own ID. I tried different DIVs that I thought would also work, and I tried linearize_tables but it didn't seem to do anything. Any suggestion?

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class IMDBTopMoviesHi9999(BasicNewsRecipe):
    language       = 'en'
    __author__     = 'ireadtheinternet'
    oldest_article = 999999
    max_articles_per_feed = 9999
    no_stylesheets = True
    no_javascript = True
    auto_cleanup = True
    auto_cleanup_keep = "//div[@id='titleStoryLine']" 
    # For some unknown reason this also keeps //div[@id='titleDidYouKnow']
    conversion_options = {'linearize_tables' : True}


    def parse_index(self):
        toc = self.index_to_soup('http://www.imdb.com/search/title?languages=hi&num_votes=500,&production_status=released&sort=year,desc&title_type=feature&user_rating=7.0,10')

        articles = []
        for row in toc.findAll('td', attrs={'class':'title'}):
            release_year = self.tag_to_string(row.find('span', attrs={'class':'year_type'}))
            link = row.find('a')
            title = self.tag_to_string(link) + " " + release_year
            url = 'http://www.imdb.com' + link['href']
            self.log('Found article:', link)
            self.log('\t', url)
            articles.append({'title':title, 'url':url, 'description':''})
        self.title = self.tag_to_string(toc.find('h1'))
            
        # only one section, so...    
        return [(self.title, articles)]
ireadtheinternet is offline   Reply With Quote