Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-21-2011, 05:34 PM   #1
spedinfargo
Groupie
spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.
 
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
For Testing: Roger Ebert (movie reviews) Recipe

Felt like a good afternoon to learn Python so I threw together a Roger Ebert recipe. Feel free to pull down and give me some feedback...

A few notes:
1) There was no good RSS feed (there is one but it's terrible) so I had to go the parse_index route.

2) The HTML is kind of a mess so I couldn't figure out a good way to use BeautifulSoup - so the regex's are kind of messy. Hopefully they hold up.

3) I'm getting some strange characters in some of the articles - I don't know if this has to do with encoding or what's the deal there.

4) Roger spends a ton of time on his Blog lately. I want to pull that in eventually but there isn't a printer-friendly version of any of his posts. Some of his web site is pretty much abandoned (esp. movie answer man) and sometimes they link to his blog posts from the main site - I tried to filter those out but once in a while you'll see a title of "Ebert Journal Post" with only an intro paragraph. When I incorporate his blog posts into the recipe this will hopefully go away...

Download on the next message in this thread...
spedinfargo is offline   Reply With Quote
Old 01-21-2011, 05:35 PM   #2
spedinfargo
Groupie
spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.
 
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
Code:


import re
import urllib2
import time
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, SoupStrainer

class Ebert(BasicNewsRecipe):
    title                 = 'Roger Ebert'
    __author__            = 'Shane Erstad'
    description           = 'Roger Ebert Movie Reviews'
    publisher             = 'Chicago Sun Times'
    category              = 'movies'
    oldest_article        = 8
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf-8'
    masthead_url          = 'http://rogerebert.suntimes.com/graphics/global/roger.jpg'
    language              = 'en'
    remove_empty_feeds    = False
    PREFIX                  = 'http://rogerebert.suntimes.com'
    patternReviews                = r'<span class="*?movietitle"*?>(.*?)</span>.*?<div class="*?headline"*?>(.*?)</div>(.*?)</div>'
    patternCommentary       = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?COMMENTARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternPeople           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?PEOPLE.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternGlossary           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?GLOSSARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    


    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }


    feeds          = [
                        (u'Reviews'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=reviews' )
                        ,(u'Commentary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=COMMENTARY')
                        ,(u'Great Movies'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS08')
                        ,(u'People'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=PEOPLE')
                        ,(u'Glossary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=GLOSSARY')
                        
                     ]

    preprocess_regexps = [
        (re.compile(r'<font.*?>.*?This is a printer friendly.*?</font>.*?<hr>', re.DOTALL|re.IGNORECASE),
            lambda m: '')
    ]
    


    def print_version(self, url):
        return url + '&template=printart'

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.log('\tFeedurl: ', feedurl)
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            page = urllib2.urlopen(feedurl).read()

            if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    pattern = self.patternReviews
            elif feedtitle == 'Commentary':
                    pattern = self.patternCommentary
            elif feedtitle == 'People':
                    pattern = self.patternPeople
            elif feedtitle == 'Glossary':
                    pattern = self.patternGlossary
                    
                    
            regex = re.compile(pattern, re.IGNORECASE|re.DOTALL)

            for match in regex.finditer(page):
                if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    movietitle = match.group(1)
                    thislink = match.group(2)
                    description = match.group(3)
                elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary':
                    thislink = match.group(1)
                    description = match.group(2)

                self.log(thislink)
                 
                for link in BeautifulSoup(thislink, parseOnlyThese=SoupStrainer('a')):
                    thisurl = self.PREFIX + link['href']
                    thislinktext = self.tag_to_string(link)

                    if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                        thistitle = movietitle
                    elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary':
                        thistitle = thislinktext

                    if thistitle == '':
                        thistitle = 'Ebert Journal Post'
                    
                    """
                    pattern2 = r'AID=\/(.*?)\/'
                    reg2 = re.compile(pattern2, re.IGNORECASE|re.DOTALL)
                    match2 = reg2.search(thisurl)
                    date = match2.group(1)
                    c = time.strptime(match2.group(1),"%Y%m%d")
                    date=time.strftime("%a, %b %d, %Y", c)
                    self.log(date)
                    """

                    articles.append({
                                      'title'      :thistitle
                                     ,'date'       :''
                                     ,'url'        :thisurl
                                     ,'description':description
                                    })
            totalfeeds.append((feedtitle, articles))

        return totalfeeds
spedinfargo is offline   Reply With Quote
Advert
Old 01-21-2011, 05:36 PM   #3
spedinfargo
Groupie
spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.
 
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
By the way, first Python code which means first recipe as well. Any code review, hints, etc. would be appreciated. Any suggestions for more functionality also welcomed...
spedinfargo is offline   Reply With Quote
Old 01-22-2011, 10:30 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
looks fine, do note that you can use regexes in beautifulsoup tests to match text/atrribute and tag name values.
kovidgoyal is offline   Reply With Quote
Old 02-19-2011, 07:45 PM   #5
spedinfargo
Groupie
spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.spedinfargo is the king of the Divan.
 
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
Updated version. Kovid, is there something I should do to check in my changes, or do you just copy and paste from here?

Code:
import re
import urllib2
import time
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, SoupStrainer
from calibre import strftime

'''
      Help Needed:
       Still can't figure out why I'm getting strange characters.  Esp. the Great Movies descriptions in the TOC.
       Anyone help me figure that out?
       
      Change Log:
       2011-02-19:  Version 2:  Added "Oscars" section and fixed date problem
'''

class Ebert(BasicNewsRecipe):
    title                 = 'Roger Ebert'
    __author__            = 'Shane Erstad'
    version               = 2
    description           = 'Roger Ebert Movie Reviews'
    publisher             = 'Chicago Sun Times'
    category              = 'movies'
    oldest_article        = 8
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'UTF-8'
    masthead_url          = 'http://rogerebert.suntimes.com/graphics/global/roger.jpg'
    language              = 'en'
    remove_empty_feeds    = False
    PREFIX                  = 'http://rogerebert.suntimes.com'
    patternReviews                = r'<span class="*?movietitle"*?>(.*?)</span>.*?<div class="*?headline"*?>(.*?)</div>(.*?)</div>'
    patternCommentary       = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?COMMENTARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternPeople           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?PEOPLE.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternOscars           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?OSCARS.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    patternGlossary           = r'<div class="*?headline"*?>.*?(<a href="/apps/pbcs.dll/article\?AID=.*?GLOSSARY.*?" id="ltred">.*?</a>).*?<div class="blurb clear">(.*?)</div>'
    


    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }


    feeds          = [
                        (u'Reviews'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=reviews' )
                        ,(u'Commentary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=COMMENTARY')
                        ,(u'Great Movies'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=REVIEWS08')
                        ,(u'People'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=PEOPLE')
                        ,(u'Oscars'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=OSCARS')
                        ,(u'Glossary'   , u'http://rogerebert.suntimes.com/apps/pbcs.dll/section?category=GLOSSARY')
                        
                     ]

    preprocess_regexps = [
        (re.compile(r'<font.*?>.*?This is a printer friendly.*?</font>.*?<hr>', re.DOTALL|re.IGNORECASE),
            lambda m: '')
    ]
    


    def print_version(self, url):
        return url + '&template=printart'

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.log('\tFeedurl: ', feedurl)
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            page = urllib2.urlopen(feedurl).read()

            if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    pattern = self.patternReviews
            elif feedtitle == 'Commentary':
                    pattern = self.patternCommentary
            elif feedtitle == 'People':
                    pattern = self.patternPeople
            elif feedtitle == 'Glossary':
                    pattern = self.patternGlossary
            elif feedtitle == 'Oscars':
                    pattern = self.patternOscars
                    
                    
            regex = re.compile(pattern, re.IGNORECASE|re.DOTALL)

            for match in regex.finditer(page):
                if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                    movietitle = match.group(1)
                    thislink = match.group(2)
                    description = match.group(3)
                elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary' or feedtitle == 'Oscars':
                    thislink = match.group(1)
                    description = match.group(2)

                self.log(thislink)
                 
                for link in BeautifulSoup(thislink, parseOnlyThese=SoupStrainer('a')):
                    thisurl = self.PREFIX + link['href']
                    thislinktext = self.tag_to_string(link)

                    if feedtitle == 'Reviews' or feedtitle == 'Great Movies':
                        thistitle = movietitle
                    elif feedtitle == 'Commentary' or feedtitle == 'People' or feedtitle == 'Glossary' or feedtitle == 'Oscars':
                        thistitle = thislinktext

                    if thistitle == '':
                        continue
                    
                    
                    pattern2 = r'AID=\/(.*?)\/'
                    reg2 = re.compile(pattern2, re.IGNORECASE|re.DOTALL)
                    match2 = reg2.search(thisurl)
                    if match2:
                        c = time.strptime(match2.group(1),"%Y%m%d")
                        mydate=strftime("%A, %B %d, %Y", c)
                    else:
                        mydate = strftime("%A, %B %d, %Y")
                    self.log(mydate)
                    
                    articles.append({
                                      'title'      :thistitle
                                     ,'date'       :'  [' + mydate + ']'
                                     ,'url'        :thisurl
                                     ,'description':description
                                    })
            totalfeeds.append((feedtitle, articles))

        return totalfeeds
spedinfargo is offline   Reply With Quote
Advert
Old 02-19-2011, 09:32 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I pick them up from here.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
Free ebook today only: Roger Ebert, Awake in the Dark soondai Deals and Resources (No Self-Promotion or Affiliate Links) 7 10-01-2010 06:43 AM
movie reviews? kindlekitten Lounge 15 12-07-2009 04:04 PM
Trusted Reviews- Reviews the BeBook Madam Broshkina News 3 01-04-2009 01:06 PM


All times are GMT -4. The time now is 01:35 PM.


MobileRead.com is a privately owned, operated and funded community.