Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-26-2014, 01:48 PM   #1
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Question auto_cleanup_keep keeping too much

My first (non-RSS) recipe is almost working. It is a little odd because it is not pulling up an article per se but IMDB search results.

When I had auto_cleanup set to False, it made me appreciate the value of auto_cleanup. So now when I use it, I find I need auto_cleanup_keep, but for some reason the line
Code:
auto_cleanup_keep = "//div[@id='titleStoryLine']"
seems to grab the DidYouKnow section even though DidYouKnow has its own ID. I tried different DIVs that I thought would also work, and I tried linearize_tables but it didn't seem to do anything. Any suggestion?

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class IMDBTopMoviesHi9999(BasicNewsRecipe):
    language       = 'en'
    __author__     = 'ireadtheinternet'
    oldest_article = 999999
    max_articles_per_feed = 9999
    no_stylesheets = True
    no_javascript = True
    auto_cleanup = True
    auto_cleanup_keep = "//div[@id='titleStoryLine']" 
    # For some unknown reason this also keeps //div[@id='titleDidYouKnow']
    conversion_options = {'linearize_tables' : True}


    def parse_index(self):
        toc = self.index_to_soup('http://www.imdb.com/search/title?languages=hi&num_votes=500,&production_status=released&sort=year,desc&title_type=feature&user_rating=7.0,10')

        articles = []
        for row in toc.findAll('td', attrs={'class':'title'}):
            release_year = self.tag_to_string(row.find('span', attrs={'class':'year_type'}))
            link = row.find('a')
            title = self.tag_to_string(link) + " " + release_year
            url = 'http://www.imdb.com' + link['href']
            self.log('Found article:', link)
            self.log('\t', url)
            articles.append({'title':title, 'url':url, 'description':''})
        self.title = self.tag_to_string(toc.find('h1'))
            
        # only one section, so...    
        return [(self.title, articles)]
ireadtheinternet is offline   Reply With Quote
Old 10-26-2014, 02:59 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
auto_cleanup is a littel hard ot control. You'll probably find it easier to just use keep_only_tags and remove_tags
kovidgoyal is offline   Reply With Quote
Advert
Old 10-31-2014, 07:59 AM   #3
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Smile

Thanks, Kovid, this helped a lot. I am still working on the overall recipe, but this part is out the way now.

Just for no reason, thought I would post what I ended up with. Not the whole recipe, because it is still in progress, just the attributes mentioned.

Code:
    keep_only_tags = [
        dict(name='div', attrs={'id': ['title-overview-widget']}),
    ]

    remove_tags = [
        dict(name='div', attrs={'id': ['meterChangeRow']}),
        dict(name='div', attrs={'id': ['meterHeaderBox']}),
        dict(name='div', attrs={'id': ['meterSeeMoreRow']}),
        dict(name='div', attrs={'class': ['star-box-rating-widget']}),
        dict(name='div', attrs={'class': ['star-box-details']}),
        dict(name='td',  attrs={'id': ['overview-bottom']}),
        dict(name='div', attrs={'class': ['pro-title-link text-center']})
    ]

Last edited by ireadtheinternet; 10-31-2014 at 07:57 PM.
ireadtheinternet is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Are you keeping score? Kasper Hviid Kobo Reader 13 06-15-2014 03:09 PM
Who is still on the fence about keeping their PW? sparklemotion Amazon Kindle 21 10-13-2012 11:49 PM
Keeping Wifi on? Tsaukpaetra Kindle Developer's Corner 4 11-15-2011 01:39 PM
Keeping it clean cypherslock Kobo Tablets 3 11-01-2011 07:42 PM


All times are GMT -4. The time now is 05:37 AM.


MobileRead.com is a privately owned, operated and funded community.