View Single Post
Old 03-15-2010, 10:15 PM   #1609
Hamlet53
Nameless Being
 
Revised SFBG recipe

I had requested a recipe for the San Francisco Bay Guardian, and this was included in the latest version release of Calibre. Unfortunately the stock recipe results in the download of only a small part of the total weekly paper. I understand why as at the main RSS page for the SFBG web site the link labeled “Main Site (everything) “ is not that at all [everything]. Using the stock recipe as I guide I have prepared the expanded version here that obtains not everything, but at least a lot more. That is if anyone else is interested.

Spoiler:

from calibre.web.feeds.news import BasicNewsRecipe

class SanFranciscoBayGuardian(BasicNewsRecipe):
title = u'San Francisco Bay Guardian'
language = 'en'
__author__ = 'Krittika Goyal'
oldest_article = 31 #days
max_articles_per_feed = 25
#encoding = 'latin1'

no_stylesheets = True
#remove_tags_before = dict(name='div', attrs={'id':'story_header'})
#remove_tags_after = dict(name='div', attrs={'id':'shirttail'})
remove_tags = [
dict(name='iframe'),
#dict(name='div', attrs={'class':'related-articles'}),
#dict(name='div', attrs={'id':['story_tools', 'toolbox', 'shirttail', 'comment_widget']}),
#dict(name='ul', attrs={'class':'article-tools'}),
#dict(name='ul', attrs={'id':'story_tabs'}),
]


feeds = [
('sfbg', 'http://www.sfbg.com/rss.xml'),
('politics', 'http://www.sfbg.com/politics/rss.xml'),
('blogs', 'http://www.sfbg.com/blog/rss.xml'),
('pixel_vision', 'http://www.sfbg.com/pixel_vision/rss.xml'),
('bruce', 'http://www.sfbg.com/bruce/rss.xml'),
]


#def preprocess_html(self, soup):
#story = soup.find(name='div', attrs={'id':'story_body'})
#td = heading.findParent(name='td')
#td.extract()
#soup = BeautifulSoup('<html><head><title>t</title></head><body></body></html>')
#body = soup.find(name='body')
#body.insert(0, story)
#return soup