Junior Member
Posts: 4
Karma: 10
Join Date: Jul 2010
Device: Nook
|
The Appleton Post Crescent Recipe - Take Two
Hope I did this right
Spoiler:
Code:
import string, re
#!/usr/bin/env python
__license__ = 'GPL v3'
__copyright__ = '2009 Kovid Goyal <kovid at kovidgoyal.net>'
from calibre.web.feeds.news import BasicNewsRecipe
class AppletonPostCrescent(BasicNewsRecipe):
title = u'Appleton Post Crescent'
oldest_article = 2
language = 'en'
__author__ = 'Joseph Kitzmiller and Sujata Raman'
max_articles_per_feed = 25
no_stylesheets = True
use_embedded_content = False
remove_javascript = True
encoding = 'cp1252'
cover_url = u'http://www.postcrescent.com/ic/assets/frontpage.pdf'
publisher = 'Appleton Post Crescent, Gannett'
category = 'news, Appleton, Fox Cities, Wisconsin'
extra_css = '''
h1{font-family:Arial,Helvetica,sans-serif; font-size:large; color:#0E5398; }
h2{color:#666666;}
.blog_title{color:#4E0000; font-family:Georgia,"Times New Roman",Times,serif; font-size:large;}
.sidebar-photo{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:30%;}
.blog_post{font-family:Arial,Helvetica,sans-serif; color:#222222; font-size:xx-small;}
.article-bodytext{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; color:#222222;font-weight:normal;}
.ratingbyline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:50%;}
.author{font-family:Arial,Helvetica,sans-serif; color:#777777; font-size:50%;}
.date{font-family:Arial,Helvetica,sans-serif; color:#777777; font-size:50%;}
.padding{font-family:Arial,Helvetica,sans-serif; font-size:70%; color:#222222; font-weight:normal;}
'''
preprocess_regexps = [
(re.compile(r'<p></p><div.*</div>', re.IGNORECASE | re.DOTALL), lambda match : r''),
]
keep_only_tags = [dict(name='div', attrs={'class':['padding','sidebar-photo']})]
remove_tags = [ dict(name=['object','link','table','embed','script', 'noscript'])
,dict(name='div',attrs={'id':["pluckcomments","StoryChat"]})
,dict(name='div',attrs={'class':['article-tools',"padding article-sidebar",'articleflex-container','poster-container','newslist','footer-container','sidebar-related','sub']})
,dict(name='p',attrs={'class':['posted','tags']})]
feeds = [(u'Breaking News', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSbreaking.pbs&mime=xml'),
(u'Latest Headlines', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSlatest.pbs&mime=xml'),
(u'Local News', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSlocal.pbs&mime=xml'),
(u'Sports', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSsports.pbs&mime=xml'),
(u'Buzz Blog', u'http://sitelife.postcrescent.com/ver1.0/Blog/BlogRss?plckBlogId=Blog:9a8980f0-f726-439c-8c4e-1dc0f788941e'),
(u'Weekend Blog', u'http://sitelife.postcrescent.com/ver1.0/Blog/BlogRss?plckBlogId=Blog:9dbf4deb-0468-41b7-a0c7-3a777c03d64c')]
def preprocess_html(self, soup):
for item in soup.findAll(style=True):
del item['style']
for item in soup.findAll(face=True):
del item['face']
return soup
As far as the API page you referenced, I did look that over. I, too, tried using a filter_regexps to no avail. I'll admit I haven't thoroughly studied that page thanks to a combination of confusion, frustration and tiredness. However, if you still wish to share your expertise in the parse_index... that would be wonderful.
Edit: FYI... I've been studying the pages' html code using Firebug in Firefox. If that helps.
|