MobileRead Forums - View Single Post

Snarkastica · 10-13-2013, 08:25 PM

Not sure if you're still interested in this recipe, but I'm a huge fan of TWoP and I've been looking for a way to capture their recaps in a format like this, so I ended up writing one. This will grab all pages from a multipage recap and make them into a single article.

The following code can be used in a few ways:

1) Grab the latest recaps for all active shows from the RSS feed. This is configured as the default

2) You can also grab the latest recaps from a specific show by adding its RSS feed to the feeds list. http://www.televisionwithoutpity.com...W-NAME/rss.xml is the usual format.

3) By making a couple small modifications, you can instead pull down the entire collection of a show's recaps. I did this with parse_index because the individual show feeds don't contain links to all episodes. If you do this, I would recommend uncommenting reverse_article_sort as well so you get the recaps in show order.

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from BeautifulSoup import Tag

class TelevisionWithoutPity(BasicNewsRecipe):
    title          = u'Television Without Pity'
    language       = 'en'
    __author__     = 'Snarkastica'
    SHOW = 'http://www.televisionwithoutpity.com/show/SHOW-NAME-HERE/recaps/' # Used for pulling down an entire show, not just the RSS feed
    oldest_article = 7 #days
    max_articles_per_feed = 25
    #reverse_article_order=True # Useful for entire show, to display in episode order
    #encoding = 'cp1252'
    use_embedded_content = False

    preprocess_regexps = [(re.compile(r'<span class="headline_recap_title .*?>', re.DOTALL|re.IGNORECASE), lambda match: '<span class="headline_recap_title">')]
    keep_only_tags = [dict(name='span', attrs={'class':'headline_recap_title'}), dict(name='p', attrs={'class':'byline'}), dict(name='div', attrs={'class':'body_recap'}), dict(name='h1')]
    no_stylesheets = True

    # Comment this out and configure process_index() to retrieve a single show
    feeds          = [
('Ltest Recaps',
 'http://www.televisionwithoutpity.com/rss.xml'),
]

    '''
    This method can be used to grab all recaps for a single show
    Set the SHOW constant at the beginning of this file to the URL for a show's recap page
    (the page listing all recaps, usually of the form:
    http://www.televisionwithoutpity.com/show/SHOW-NAME/recaps/"
    Where SHOW-NAME is the hyphenated name of the show.
    
    To use:
    1. Comment out feeds = [...] earlier in this file
    2. Set the SHOW constant to the show's recap page
    3. Uncomment the following function
    '''

    '''
    def parse_index(self):
        soup = self.index_to_soup(self.SHOW)
        feeds = []
        articles = []
        showTitle = soup.find('h1').string
        recaps = soup.find('table')
        for ep in recaps.findAll('tr'):
            epData = ep.findAll('td')
            epNum = epData[0].find(text=True).strip()
            if not epNum == "Ep.":
                epT = self.tag_to_string(epData[1].find('em')).strip()
                epST = " (or " + self.tag_to_string(epData[1].find('h3')).strip() + ")"
                epTitle = epNum + ": " + epT + epST
                epData[1].find('em').extract()
                epURL = epData[1].find('a', href=True)
                epURL = epURL['href']
                epSum = self.tag_to_string(epData[1].find('p')).strip()
                epDate = epData[2].find(text=True).strip()
                epAuthor = self.tag_to_string(epData[4].find('p')).strip()
                articles.append({'title':epTitle, 'url':epURL, 'description':epSum, 'date':epDate, 'author':epAuthor})
        feeds.append((showTitle, articles))
        #self.abort_recipe_processing("test")
        return feeds
    '''

    # This will add subsequent pages of multipage recaps to a single article page
    def append_page(self, soup, appendtag, position):
        if (soup.find('p',attrs={'class':'pages'})): # If false, will still grab single-page recaplets
            pager = soup.find('p',attrs={'class':'pages'}).find(text='Next')
            if pager:
                nexturl = pager.parent['href']
                soup2 = self.index_to_soup(nexturl)
                texttag = soup2.find('div', attrs={'class':'body_recap'})
                for it in texttag.findAll(style=True):
                    del it['style']
                newpos = len(texttag.contents)          
                self.append_page(soup2,texttag,newpos)
                texttag.extract()
                appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return soup

    # Remove the multi page links (we had to keep these in for append_page(), but they can go away now
    # Could have used CSS to hide, but some readers ignore CSS.
    def postprocess_html(self, soup, first_fetch):
        print ("entering post")
        paginator = soup.findAll('p', attrs={'class':'pages'})
        if paginator:
            for p in paginator:
                p.extract()
                
                # TODO: Fix this so it converts the headline class into a heading 1
        #titleTag = Tag(soup, "h1")
        #repTag = soup.find('span', attrs={'class':'headline_recap_title'})
        #titleTag.insert(0, repTag.contents[0])
        #repTag.extract()
        #soup.body.insert(1, titleTag)
        return soup

This is the first recipe I've done, so maybe there are a few things I could do differently, but it worked for my purposes.

If anyone has suggestions, I'm happy to learn.

There are a couple of TODOs for the next version: specifically changing the episode headline into a heading 1, and making it so when you pull an entire show, each season becomes a section, instead of having all episodes in one.

Hope this works for you. let me know if you have questions.