MobileRead Forums - View Single Post

Starson17 · 04-15-2011, 05:17 PM

Quote:

Originally Posted by Starson17

I'll check my main system this weekend and try to post it here for you and Kovid (when his eye gets better).

Try this. There were some errors in the RSS feed, and I thought they'd eventually fix them. I recall that's why I was waiting. They didn't fix them, so I fixed them here.
Try this:

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
import re

class BigOven(BasicNewsRecipe):
    title               = 'BigOven'
    __author__          = 'Starson17'
    description         = 'Recipes for the Foodie in us all. Registration is free. A fake username and password just gives smaller photos.'
    language            = 'en'
    category            = 'news, food, recipes, gourmet'
    publisher           = 'Starson17'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds    = True
    cover_url           = 'http://www.software.com/images/products/BigOven%20Logo_177_216.JPG'
    max_articles_per_feed = 30
    needs_subscription = True

    conversion_options = {'linearize_tables'  : True
                        , 'comment'           : description
                        , 'tags'              : category
                        , 'publisher'         : publisher
                        , 'language'          : language
                        }
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://www.bigoven.com/account/login?ReturnUrl=/')
            br.select_form(nr=1)
            br['Email']  = self.username
            br['Password'] = self.password
            br.submit()
        return br

    remove_attributes = ['style', 'font']

    def get_article_url(self, article):
        url = article.get('feedburner_origlink',article.get('link', None))
        front, middle, end = url.partition('comhttp//www.bigoven.com')
        url = front + 'com' + end
        return url

    keep_only_tags = [dict(name='div', attrs={'id':['nosidebar_main']})]

    remove_tags_after = [dict(name='div', attrs={'class':['display-field']})]
    
    remove_tags =  [dict(name='ul', attrs={'class':['tabs']})]
     
    preprocess_regexps = [
        (re.compile(r'Want detailed nutrition information?', re.DOTALL), lambda match: ''),
        (re.compile('\(You could win \$100 in our ', re.DOTALL), lambda match: ''),
         ]
   
    def preprocess_html(self, soup):
        for tag in soup.findAll(name='a', text=re.compile(r'.*View Metric.*', re.DOTALL)):
            tag.parent.parent.extract()
        for tag in soup.findAll(text=re.compile(r'.*Try BigOven Pro for Free.*', re.DOTALL)):
            tag.extract()
        for tag in soup.findAll(text=re.compile(r'.*Add my photo of this recipe.*', re.DOTALL)):
            tag.parent.extract()
        for tag in soup.findAll(name='a', text=re.compile(r'.*photo contest.*', re.DOTALL)):
            tag.parent.extract()
        for tag in soup.findAll(name='a', text='Remove ads'):
            tag.parent.parent.extract()
        for tag in soup.findAll(name='ol', attrs={'class':['recipe-tags']}):
            tag.parent.extract()
        return soup

    feeds = [(u'Recent Raves', u'http://www.bigoven.com/rss/recentraves'),
                   (u'Recipe Of The Day', u'http://feeds.feedburner.com/bigovencom-RecipeOfTheDay')]

If you see anything that needs fixing, let me know. The site has changed significantly, so I may have missed some cleanup. I was showing someone how to write recipes, so this has a variety of methods of removing junk. It may not be the most efficient in all cases, but it works.

If it seems to work for you, let us know, and I'm sure Kovid will fix the builtin when he's feeling better.