Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-28-2014, 01:29 PM   #1
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
New Recipe: IMDB Advanced Title Search

EDIT 2 - Dec. 3, 2014 - Working again, but I am keeping my eye on it. Updated code below.
EDIT - no longer working, I will need to revisit it, possible change on IMDB site


This is a recipe based on IMDB Advanced Title Search (http://www.imdb.com/search/title). The basic idea is you create your favorite advanced title search on the web site, and you look at the key=value parts of the IMDB url and plug them into the custom_imdb_searches list. In this context, sections are based on specific searches and "articles" are the basic movie info and poster. So I consider this a template, because everyone's favorite movie searches I assume will be different. That said there are several suggested searches in the recipe, two active and the others commented out.

Code:
    custom_imdb_searches = [
        # Each item here creates a new movie section based on 
        # IMDB Advanced Search
        dict(),  # use defaults
        dict(sort='user_rating,desc'),  # sort by user rating instead of newest first
        #dict(languages='hi'),                    # Hindi movies
        #dict(languages='hi',has='asin-dvd-us'),  # Hindi movies at amazon.com
        #dict(url='http://www.imdb.com/search/title?production_status=released&title_type=feature'),
    ]

You set up your own searches in the custom_imdb_searches list of the recipe. Check out the default criteria in the imdb_search method also, because you don't need to specify those criteria unless you are overriding them.

Code:
        criteria = {                            # Default criteria:
            'title_type': 'feature',            # movies only, no TV shows
            'production_status': 'released',    # that have been released
            'user_rating': '6.5,10',            # with user rating of 6.5-10
            'num_votes': '500,',                # with at least 500 votes
            'sort': 'year,desc'                 # sort by year, descending
        }
It will also run without changing anything, recommended the first time you run it so you can see what it is about.

Please let me know of any suggestions or criticisms, thanks!

recipe updated 12/3
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class IMDBAdvancedTitleSearch2468(BasicNewsRecipe):
    title = 'IMDB Advanced Title Search'
    language = 'en'
    categories = 'IMDB,template,movies'
    __author__ = 'ireadtheinternet'
    max_articles_per_feed = 50
    no_stylesheets = True
    no_javascript = True
    preprocess_regexps = [
        (re.compile(r'»'), lambda match: '')
    ]
    extra_css = 'img:first-of-type { display : block; margin-left : auto; margin-right: auto }'

    keep_only_tags = [
        dict(name='td', attrs={'id': ['img_primary']}), #poster
        dict(name='h1', attrs={'class': ['header']}), #title
        dict(name='div', attrs={'class': ['infobar']}), #length, genre, release
        dict(name='div', attrs={'itemtype': ['http://schema.org/Person']}), #people
        dict(name='div', attrs={'class': ['inline canwrap']}) #storyline
    ]

    remove_tags = [
        dict(name='div', attrs={'class': ['pro-title-link text-center']}),
    ]

    IMDB_BASE = 'http://www.imdb.com'

    # Make quick customizations of the recipe by changing custom_imdb_searches
    # First go to IMDB Advanced Title Search: http://www.imdb.com/search/title
    # Do your favorite search and figure out which non-defaults args you need
    # from the url (Defaults are in criteria dict in the imdb_search method.)
    # Alternatively, just copy/paste the url into the url arg
    
    custom_imdb_searches = [
        # Each item here creates a new movie section based on 
        # IMDB Advanced Search
        dict(),  # use defaults
        dict(sort='user_rating,desc'),  # sort by user rating instead of newest first
        #dict(languages='hi'),                    # Hindi movies
        #dict(languages='hi',has='asin-dvd-us'),  # Hindi movies at amazon.com
        #dict(url='http://www.imdb.com/search/title?production_status=released&title_type=feature'),
    ]

    def build_section(self, url):
        articles = []
        toc_page_raw = self.index_to_soup(url, raw=True)
        toc_page_raw = re.sub(r'<script\b.+?</script>', '', 
            toc_page_raw, flags=re.DOTALL|re.IGNORECASE)
        toc_page = self.index_to_soup(toc_page_raw)
        toc = toc_page.find(name='div', attrs={'id': 'main'})

        for movie in toc.findAll('a', attrs={'href':re.compile(r'/title/tt.*'),'title':True}):
            title = self.tag_to_string(movie)
            url = self.IMDB_BASE + movie['href']
            #self.log('Found movie:', movie)
            #self.log('\t', url)
            articles.append({'title': title, 'url': url, 'date':'','description': ''})

        name = self.tag_to_string(toc_page.find('h1'))

        return name, articles

    def imdb_search(self, url=None, **kwargs):
        search_url = url
        self.IMDB_BASE = 'http://www.imdb.com'
        if url is not None:
            if url.startswith('http://') or url.startswith('https://'):
                return search_url
            else:
                search_url = self.IMDB_BASE + '/search/title?' + search_url
                return search_url

        search_url = self.IMDB_BASE + '/search/title?'

        criteria = {                            # Default criteria:
            'title_type': 'feature',            # movies only, no TV shows
            'production_status': 'released',    # that have been released
            'user_rating': '6.5,10',            # with user rating of 6.5-10
            'num_votes': '500,',                # with at least 500 votes
            'sort': 'year,desc'                 # sort by year, descending
        }

        # merge args with criteria, possibly overriding original criteria
        criteria.update(kwargs)

        criteria_list = [key + '=' + criteria[key] for key in criteria]

        search_url = search_url + '&'.join(criteria_list)
        return search_url

    def parse_index(self):
        self.log('def parse_index(self)')
        feeds = []
        
        for search in self.custom_imdb_searches: 
            feeds.append((self.build_section(self.imdb_search(**search))))

        return feeds

    def preprocess_html(self, soup):
    
        for alink in soup.findAll('a'):
            alink_text = ''.join(alink.findAll(text=True))
            found_img = alink.find('img') is not None
            if found_img is False:
                alink.name, alink.attrs = 'div', {}
                alink.replaceWith(alink_text)

        for t in soup.findAll(['table', 'td', 'tr', 'tbody']):
            t.name, t.attrs = 'div', {}

        return soup

Last edited by ireadtheinternet; 12-03-2014 at 01:27 PM. Reason: updated code
ireadtheinternet is offline   Reply With Quote
Old 12-23-2014, 11:52 PM   #2
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Still have some bugs in this, some movies' info is still not downloading, haven't had time to troubleshoot why.
ireadtheinternet is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Advanced search for Multiple Authors? Chris_Snow Library Management 7 09-16-2014 09:34 PM
Advanced search within ebook using application or regex Earthlark Calibre 3 02-04-2014 03:33 AM
Advanced search within ebook using application or regex Earthlark General Discussions 2 02-04-2014 12:10 AM
Problem using 'Advanced Search' with 'Search in Forum' Wetdogeared Feedback 6 06-21-2011 09:37 AM
advanced text search and non-ascii characters msz59 General Discussions 0 05-05-2011 09:47 AM


All times are GMT -4. The time now is 06:38 PM.


MobileRead.com is a privately owned, operated and funded community.