auto_cleanup_keep keeping too much

ireadtheinternet · 10-26-2014, 02:48 PM

My first (non-RSS) recipe is almost working. It is a little odd because it is not pulling up an article per se but IMDB search results.

When I had auto_cleanup set to False, it made me appreciate the value of auto_cleanup. So now when I use it, I find I need auto_cleanup_keep, but for some reason the line

Code:

auto_cleanup_keep = "//div[@id='titleStoryLine']"

seems to grab the DidYouKnow section even though DidYouKnow has its own ID. I tried different DIVs that I thought would also work, and I tried linearize_tables but it didn't seem to do anything. Any suggestion?

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class IMDBTopMoviesHi9999(BasicNewsRecipe):
    language       = 'en'
    __author__     = 'ireadtheinternet'
    oldest_article = 999999
    max_articles_per_feed = 9999
    no_stylesheets = True
    no_javascript = True
    auto_cleanup = True
    auto_cleanup_keep = "//div[@id='titleStoryLine']" 
    # For some unknown reason this also keeps //div[@id='titleDidYouKnow']
    conversion_options = {'linearize_tables' : True}


    def parse_index(self):
        toc = self.index_to_soup('http://www.imdb.com/search/title?languages=hi&num_votes=500,&production_status=released&sort=year,desc&title_type=feature&user_rating=7.0,10')

        articles = []
        for row in toc.findAll('td', attrs={'class':'title'}):
            release_year = self.tag_to_string(row.find('span', attrs={'class':'year_type'}))
            link = row.find('a')
            title = self.tag_to_string(link) + " " + release_year
            url = 'http://www.imdb.com' + link['href']
            self.log('Found article:', link)
            self.log('\t', url)
            articles.append({'title':title, 'url':url, 'description':''})
        self.title = self.tag_to_string(toc.find('h1'))
            
        # only one section, so...    
        return [(self.title, articles)]

kovidgoyal · 10-26-2014, 03:59 PM

auto_cleanup is a littel hard ot control. You'll probably find it easier to just use keep_only_tags and remove_tags

ireadtheinternet · 10-31-2014, 08:59 AM

Thanks, Kovid, this helped a lot. I am still working on the overall recipe, but this part is out the way now.

Just for no reason, thought I would post what I ended up with. Not the whole recipe, because it is still in progress, just the attributes mentioned.

Code:

    keep_only_tags = [
        dict(name='div', attrs={'id': ['title-overview-widget']}),
    ]

    remove_tags = [
        dict(name='div', attrs={'id': ['meterChangeRow']}),
        dict(name='div', attrs={'id': ['meterHeaderBox']}),
        dict(name='div', attrs={'id': ['meterSeeMoreRow']}),
        dict(name='div', attrs={'class': ['star-box-rating-widget']}),
        dict(name='div', attrs={'class': ['star-box-details']}),
        dict(name='td',  attrs={'id': ['overview-bottom']}),
        dict(name='div', attrs={'class': ['pro-title-link text-center']})
    ]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Are you keeping score?	Kasper Hviid	Kobo Reader	13	06-15-2014 04:09 PM
Who is still on the fence about keeping their PW?	sparklemotion	Amazon Kindle	21	10-14-2012 12:49 AM
Keeping Wifi on?	Tsaukpaetra	Kindle Developer's Corner	4	11-15-2011 02:39 PM
Keeping it clean	cypherslock	Kobo Tablets	3	11-01-2011 08:42 PM

10-26-2014, 03:59 PM	#2
kovidgoyal creator of calibre Posts: 45,681 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	auto_cleanup is a littel hard ot control. You'll probably find it easier to just use keep_only_tags and remove_tags

Advert