01-21-2011, 06:34 PM | #1 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
For Testing: Roger Ebert (movie reviews) Recipe
Felt like a good afternoon to learn Python so I threw together a Roger Ebert recipe. Feel free to pull down and give me some feedback...
A few notes: 1) There was no good RSS feed (there is one but it's terrible) so I had to go the parse_index route. 2) The HTML is kind of a mess so I couldn't figure out a good way to use BeautifulSoup - so the regex's are kind of messy. Hopefully they hold up. 3) I'm getting some strange characters in some of the articles - I don't know if this has to do with encoding or what's the deal there. 4) Roger spends a ton of time on his Blog lately. I want to pull that in eventually but there isn't a printer-friendly version of any of his posts. Some of his web site is pretty much abandoned (esp. movie answer man) and sometimes they link to his blog posts from the main site - I tried to filter those out but once in a while you'll see a title of "Ebert Journal Post" with only an intro paragraph. When I incorporate his blog posts into the recipe this will hopefully go away... Download on the next message in this thread... |
01-21-2011, 06:35 PM | #2 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Code:
import re import urllib2 import time from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, SoupStrainer class Ebert(BasicNewsRecipe): title = 'Roger Ebert' __author__ = 'Shane Erstad' description = 'Roger Ebert Movie Reviews' publisher = 'Chicago Sun Times' category = 'movies' oldest_article = 8 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False encoding = 'utf-8' masthead_url = 'http://rogerebert.suntimes.com/graphics/global/roger.jpg' language = 'en' remove_empty_feeds = False PREFIX = 'http://rogerebert.suntimes.com' patternReviews = r'<span class="*?movietitle"*?>(.*?)</span>.*?<div class="*?headline"*?>(.*?)</div>(.*?)</div>' patternCommentary = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?COMMENTARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>' patternPeople = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?PEOPLE.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>' patternGlossary = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?GLOSSARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>' conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language , 'linearize_tables' : True } feeds = [ (u'Reviews' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=reviews' ) ,(u'Commentary' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=COMMENTARY') ,(u'Great Movies' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS08') ,(u'People' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=PEOPLE') ,(u'Glossary' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=GLOSSARY') ] preprocess_regexps = [ (re.compile(r'<font.*?>.*?This is a printer friendly.*?</font>.*?<hr>', re.DOTALL|re.IGNORECASE), lambda m: '') ] def print_version(self, url): return url + '&template=printart' def parse_index(self): totalfeeds = [] lfeeds = self.get_feeds() for feedobj in lfeeds: feedtitle, feedurl = feedobj self.log('\tFeedurl: ', feedurl) self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl)) articles = [] page = urllib2.urlopen(feedurl).read() if feedtitle == 'Reviews' or feedtitle == 'Great Movies': pattern = self.patternReviews elif feedtitle == 'Commentary': pattern = self.patternCommentary elif feedtitle == 'People': pattern = self.patternPeople elif feedtitle == 'Glossary': pattern = self.patternGlossary regex = re.compile(pattern, re.IGNORECASE|re.DOTALL) for match in regex.finditer(page): if feedtitle == 'Reviews' or feedtitle == 'Great Movies': movietitle = match.group(1) thislink = match.group(2) description = match.group(3) elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary': thislink = match.group(1) description = match.group(2) self.log(thislink) for link in BeautifulSoup(thislink, parseOnlyThese=SoupStrainer('a')): thisurl = self.PREFIX + link['href'] thislinktext = self.tag_to_string(link) if feedtitle == 'Reviews' or feedtitle == 'Great Movies': thistitle = movietitle elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary': thistitle = thislinktext if thistitle == '': thistitle = 'Ebert Journal Post' """ pattern2 = r'AID=\/(.*?)\/' reg2 = re.compile(pattern2, re.IGNORECASE|re.DOTALL) match2 = reg2.search(thisurl) date = match2.group(1) c = time.strptime(match2.group(1),"%Y%m%d") date=time.strftime("%a, %b %d, %Y", c) self.log(date) """ articles.append({ 'title' :thistitle ,'date' :'' ,'url' :thisurl ,'description':description }) totalfeeds.append((feedtitle, articles)) return totalfeeds |
Advert | |
|
01-21-2011, 06:36 PM | #3 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
By the way, first Python code which means first recipe as well. Any code review, hints, etc. would be appreciated. Any suggestions for more functionality also welcomed...
|
01-22-2011, 11:30 AM | #4 |
creator of calibre
Posts: 44,601
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
looks fine, do note that you can use regexes in beautifulsoup tests to match text/atrribute and tag name values.
|
02-19-2011, 08:45 PM | #5 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Updated version. Kovid, is there something I should do to check in my changes, or do you just copy and paste from here?
Code:
import re import urllib2 import time from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, SoupStrainer from calibre import strftime ''' Help Needed: Still can't figure out why I'm getting strange characters. Esp. the Great Movies descriptions in the TOC. Anyone help me figure that out? Change Log: 2011-02-19: Version 2: Added "Oscars" section and fixed date problem ''' class Ebert(BasicNewsRecipe): title = 'Roger Ebert' __author__ = 'Shane Erstad' version = 2 description = 'Roger Ebert Movie Reviews' publisher = 'Chicago Sun Times' category = 'movies' oldest_article = 8 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False encoding = 'UTF-8' masthead_url = 'http://rogerebert.suntimes.com/graphics/global/roger.jpg' language = 'en' remove_empty_feeds = False PREFIX = 'http://rogerebert.suntimes.com' patternReviews = r'<span class="*?movietitle"*?>(.*?)</span>.*?<div class="*?headline"*?>(.*?)</div>(.*?)</div>' patternCommentary = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?COMMENTARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>' patternPeople = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?PEOPLE.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>' patternOscars = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?OSCARS.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>' patternGlossary = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?GLOSSARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>' conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language , 'linearize_tables' : True } feeds = [ (u'Reviews' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=reviews' ) ,(u'Commentary' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=COMMENTARY') ,(u'Great Movies' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS08') ,(u'People' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=PEOPLE') ,(u'Oscars' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=OSCARS') ,(u'Glossary' , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=GLOSSARY') ] preprocess_regexps = [ (re.compile(r'<font.*?>.*?This is a printer friendly.*?</font>.*?<hr>', re.DOTALL|re.IGNORECASE), lambda m: '') ] def print_version(self, url): return url + '&template=printart' def parse_index(self): totalfeeds = [] lfeeds = self.get_feeds() for feedobj in lfeeds: feedtitle, feedurl = feedobj self.log('\tFeedurl: ', feedurl) self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl)) articles = [] page = urllib2.urlopen(feedurl).read() if feedtitle == 'Reviews' or feedtitle == 'Great Movies': pattern = self.patternReviews elif feedtitle == 'Commentary': pattern = self.patternCommentary elif feedtitle == 'People': pattern = self.patternPeople elif feedtitle == 'Glossary': pattern = self.patternGlossary elif feedtitle == 'Oscars': pattern = self.patternOscars regex = re.compile(pattern, re.IGNORECASE|re.DOTALL) for match in regex.finditer(page): if feedtitle == 'Reviews' or feedtitle == 'Great Movies': movietitle = match.group(1) thislink = match.group(2) description = match.group(3) elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary' or feedtitle == 'Oscars': thislink = match.group(1) description = match.group(2) self.log(thislink) for link in BeautifulSoup(thislink, parseOnlyThese=SoupStrainer('a')): thisurl = self.PREFIX + link['href'] thislinktext = self.tag_to_string(link) if feedtitle == 'Reviews' or feedtitle == 'Great Movies': thistitle = movietitle elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary' or feedtitle == 'Oscars': thistitle = thislinktext if thistitle == '': continue pattern2 = r'AID=\/(.*?)\/' reg2 = re.compile(pattern2, re.IGNORECASE|re.DOTALL) match2 = reg2.search(thisurl) if match2: c = time.strptime(match2.group(1),"%Y%m%d") mydate=strftime("%A, %B %d, %Y", c) else: mydate = strftime("%A, %B %d, %Y") self.log(mydate) articles.append({ 'title' :thistitle ,'date' :' [' + mydate + ']' ,'url' :thisurl ,'description':description }) totalfeeds.append((feedtitle, articles)) return totalfeeds |
Advert | |
|
02-19-2011, 10:32 PM | #6 |
creator of calibre
Posts: 44,601
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I pick them up from here.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Recipe works when mocked up as Python file, fails when converted to Recipe | ode | Recipes | 7 | 09-04-2011 05:57 AM |
Free ebook today only: Roger Ebert, Awake in the Dark | soondai | Deals and Resources (No Self-Promotion or Affiliate Links) | 7 | 10-01-2010 07:43 AM |
movie reviews? | kindlekitten | Lounge | 15 | 12-07-2009 05:04 PM |
Trusted Reviews- Reviews the BeBook | Madam Broshkina | News | 3 | 01-04-2009 02:06 PM |