For Testing: Roger Ebert (movie reviews) Recipe

spedinfargo · 01-21-2011, 05:34 PM

Felt like a good afternoon to learn Python so I threw together a Roger Ebert recipe. Feel free to pull down and give me some feedback...

A few notes:
1) There was no good RSS feed (there is one but it's terrible) so I had to go the parse_index route.

2) The HTML is kind of a mess so I couldn't figure out a good way to use BeautifulSoup - so the regex's are kind of messy. Hopefully they hold up.

3) I'm getting some strange characters in some of the articles - I don't know if this has to do with encoding or what's the deal there.

4) Roger spends a ton of time on his Blog lately. I want to pull that in eventually but there isn't a printer-friendly version of any of his posts. Some of his web site is pretty much abandoned (esp. movie answer man) and sometimes they link to his blog posts from the main site - I tried to filter those out but once in a while you'll see a title of "Ebert Journal Post" with only an intro paragraph. When I incorporate his blog posts into the recipe this will hopefully go away...

Download on the next message in this thread...

spedinfargo · 01-21-2011, 05:35 PM

Code:



import re
import urllib2
import time
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, SoupStrainer

class Ebert(BasicNewsRecipe):
    title                 = 'Roger Ebert'
    __author__            = 'Shane Erstad'
    description           = 'Roger Ebert Movie Reviews'
    publisher             = 'Chicago Sun Times'
    category              = 'movies'
    oldest_article        = 8
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf-8'
    masthead_url          = 'http://rogerebert.suntimes.com/graphics/global/roger.jpg'
    language              = 'en'
    remove_empty_feeds    = False
    PREFIX                  = 'http://rogerebert.suntimes.com'
    patternReviews                = r'<span class="*?movietitle"*?>(.*?)</span>.*?<div class="*?headline"*?>(.*?)</div>(.*?)</div>'
    patternCommentary       = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?COMMENTARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternPeople           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?PEOPLE.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternGlossary           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?GLOSSARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    


    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }


    feeds          = [
                        (u'Reviews'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=reviews' )
                        ,(u'Commentary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=COMMENTARY')
                        ,(u'Great Movies'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS08')
                        ,(u'People'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=PEOPLE')
                        ,(u'Glossary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=GLOSSARY')
                        
                     ]

    preprocess_regexps = [
        (re.compile(r'<font.*?>.*?This is a printer friendly.*?</font>.*?<hr>', re.DOTALL|re.IGNORECASE),
            lambda m: '')
    ]
    


    def print_version(self, url):
        return url + '&template=printart'

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.log('\tFeedurl: ', feedurl)
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            page = urllib2.urlopen(feedurl).read()

            if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    pattern = self.patternReviews
            elif feedtitle == 'Commentary':
                    pattern = self.patternCommentary
            elif feedtitle == 'People':
                    pattern = self.patternPeople
            elif feedtitle == 'Glossary':
                    pattern = self.patternGlossary
                    
                    
            regex = re.compile(pattern, re.IGNORECASE|re.DOTALL)

            for match in regex.finditer(page):
                if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    movietitle = match.group(1)
                    thislink = match.group(2)
                    description = match.group(3)
                elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary':
                    thislink = match.group(1)
                    description = match.group(2)

                self.log(thislink)
                 
                for link in BeautifulSoup(thislink, parseOnlyThese=SoupStrainer('a')):
                    thisurl = self.PREFIX + link['href']
                    thislinktext = self.tag_to_string(link)

                    if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                        thistitle = movietitle
                    elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary':
                        thistitle = thislinktext

                    if thistitle == '':
                        thistitle = 'Ebert Journal Post'
                    
                    """
                    pattern2 = r'AID=\/(.*?)\/'
                    reg2 = re.compile(pattern2, re.IGNORECASE|re.DOTALL)
                    match2 = reg2.search(thisurl)
                    date = match2.group(1)
                    c = time.strptime(match2.group(1),"%Y%m%d")
                    date=time.strftime("%a, %b %d, %Y", c)
                    self.log(date)
                    """

                    articles.append({
                                      'title'      :thistitle
                                     ,'date'       :''
                                     ,'url'        :thisurl
                                     ,'description':description
                                    })
            totalfeeds.append((feedtitle, articles))

        return totalfeeds

spedinfargo · 01-21-2011, 05:36 PM

By the way, first Python code which means first recipe as well. Any code review, hints, etc. would be appreciated. Any suggestions for more functionality also welcomed...

kovidgoyal · 01-22-2011, 10:30 AM

looks fine, do note that you can use regexes in beautifulsoup tests to match text/atrribute and tag name values.

spedinfargo · 02-19-2011, 07:45 PM

Updated version. Kovid, is there something I should do to check in my changes, or do you just copy and paste from here?

Code:

import re
import urllib2
import time
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, SoupStrainer
from calibre import strftime

'''
      Help Needed:
       Still can't figure out why I'm getting strange characters.  Esp. the Great Movies descriptions in the TOC.
       Anyone help me figure that out?
       
      Change Log:
       2011-02-19:  Version 2:  Added "Oscars" section and fixed date problem
'''

class Ebert(BasicNewsRecipe):
    title                 = 'Roger Ebert'
    __author__            = 'Shane Erstad'
    version               = 2
    description           = 'Roger Ebert Movie Reviews'
    publisher             = 'Chicago Sun Times'
    category              = 'movies'
    oldest_article        = 8
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'UTF-8'
    masthead_url          = 'http://rogerebert.suntimes.com/graphics/global/roger.jpg'
    language              = 'en'
    remove_empty_feeds    = False
    PREFIX                  = 'http://rogerebert.suntimes.com'
    patternReviews                = r'<span class="*?movietitle"*?>(.*?)</span>.*?<div class="*?headline"*?>(.*?)</div>(.*?)</div>'
    patternCommentary       = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?COMMENTARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternPeople           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?PEOPLE.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternOscars           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?OSCARS.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternGlossary           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?GLOSSARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    


    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }


    feeds          = [
                        (u'Reviews'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=reviews' )
                        ,(u'Commentary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=COMMENTARY')
                        ,(u'Great Movies'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS08')
                        ,(u'People'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=PEOPLE')
                        ,(u'Oscars'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=OSCARS')
                        ,(u'Glossary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=GLOSSARY')
                        
                     ]

    preprocess_regexps = [
        (re.compile(r'<font.*?>.*?This is a printer friendly.*?</font>.*?<hr>', re.DOTALL|re.IGNORECASE),
            lambda m: '')
    ]
    


    def print_version(self, url):
        return url + '&template=printart'

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.log('\tFeedurl: ', feedurl)
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            page = urllib2.urlopen(feedurl).read()

            if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    pattern = self.patternReviews
            elif feedtitle == 'Commentary':
                    pattern = self.patternCommentary
            elif feedtitle == 'People':
                    pattern = self.patternPeople
            elif feedtitle == 'Glossary':
                    pattern = self.patternGlossary
            elif feedtitle == 'Oscars':
                    pattern = self.patternOscars
                    
                    
            regex = re.compile(pattern, re.IGNORECASE|re.DOTALL)

            for match in regex.finditer(page):
                if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    movietitle = match.group(1)
                    thislink = match.group(2)
                    description = match.group(3)
                elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary' or feedtitle == 'Oscars':
                    thislink = match.group(1)
                    description = match.group(2)

                self.log(thislink)
                 
                for link in BeautifulSoup(thislink, parseOnlyThese=SoupStrainer('a')):
                    thisurl = self.PREFIX + link['href']
                    thislinktext = self.tag_to_string(link)

                    if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                        thistitle = movietitle
                    elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary' or feedtitle == 'Oscars':
                        thistitle = thislinktext

                    if thistitle == '':
                        continue
                    
                    
                    pattern2 = r'AID=\/(.*?)\/'
                    reg2 = re.compile(pattern2, re.IGNORECASE|re.DOTALL)
                    match2 = reg2.search(thisurl)
                    if match2:
                        c = time.strptime(match2.group(1),"%Y%m%d")
                        mydate=strftime("%A, %B %d, %Y", c)
                    else:
                        mydate = strftime("%A, %B %d, %Y")
                    self.log(mydate)
                    
                    articles.append({
                                      'title'      :thistitle
                                     ,'date'       :'  [' + mydate + ']'
                                     ,'url'        :thisurl
                                     ,'description':description
                                    })
            totalfeeds.append((feedtitle, articles))

        return totalfeeds

kovidgoyal · 02-19-2011, 09:32 PM

I pick them up from here.

01-21-2011, 05:34 PM	#1
spedinfargo Groupie Posts: 158 Karma: 106422 Join Date: Nov 2010 Device: none	For Testing: Roger Ebert (movie reviews) Recipe Felt like a good afternoon to learn Python so I threw together a Roger Ebert recipe. Feel free to pull down and give me some feedback... A few notes: 1) There was no good RSS feed (there is one but it's terrible) so I had to go the parse_index route. 2) The HTML is kind of a mess so I couldn't figure out a good way to use BeautifulSoup - so the regex's are kind of messy. Hopefully they hold up. 3) I'm getting some strange characters in some of the articles - I don't know if this has to do with encoding or what's the deal there. 4) Roger spends a ton of time on his Blog lately. I want to pull that in eventually but there isn't a printer-friendly version of any of his posts. Some of his web site is pretty much abandoned (esp. movie answer man) and sometimes they link to his blog posts from the main site - I tried to filter those out but once in a while you'll see a title of "Ebert Journal Post" with only an intro paragraph. When I incorporate his blog posts into the recipe this will hopefully go away... Download on the next message in this thread...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
Free ebook today only: Roger Ebert, Awake in the Dark	soondai	Deals and Resources (No Self-Promotion or Affiliate Links)	7	10-01-2010 06:43 AM
movie reviews?	kindlekitten	Lounge	15	12-07-2009 04:04 PM
Trusted Reviews- Reviews the BeBook	Madam Broshkina	News	3	01-04-2009 01:06 PM

01-21-2011, 05:36 PM	#3
spedinfargo Groupie Posts: 158 Karma: 106422 Join Date: Nov 2010 Device: none	By the way, first Python code which means first recipe as well. Any code review, hints, etc. would be appreciated. Any suggestions for more functionality also welcomed...

01-22-2011, 10:30 AM	#4
kovidgoyal creator of calibre Posts: 46,371 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	looks fine, do note that you can use regexes in beautifulsoup tests to match text/atrribute and tag name values.

02-19-2011, 09:32 PM	#6
kovidgoyal creator of calibre Posts: 46,371 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I pick them up from here.

Advert

Advert