View Single Post
Old 07-04-2010, 01:48 AM   #2243
schnortz
Junior Member
schnortz began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Jul 2010
Device: Nook
The Appleton Post Crescent Recipe - Take Two
Hope I did this right

Spoiler:
Code:
import string, re

#!/usr/bin/env python
__license__   = 'GPL v3'
__copyright__ = '2009 Kovid Goyal <kovid at kovidgoyal.net>'

from calibre.web.feeds.news import BasicNewsRecipe

class AppletonPostCrescent(BasicNewsRecipe):
    title          = u'Appleton Post Crescent'
    oldest_article = 2
    language = 'en'

    __author__     = 'Joseph Kitzmiller and Sujata Raman'
    max_articles_per_feed = 25
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    encoding = 'cp1252'
    cover_url  = u'http://www.postcrescent.com/ic/assets/frontpage.pdf'
    publisher              = 'Appleton Post Crescent, Gannett'
    category               = 'news, Appleton, Fox Cities, Wisconsin'

    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-size:large; color:#0E5398; }
                    h2{color:#666666;}
                   .blog_title{color:#4E0000; font-family:Georgia,"Times New Roman",Times,serif; font-size:large;}
                   .sidebar-photo{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:30%;}
                   .blog_post{font-family:Arial,Helvetica,sans-serif; color:#222222; font-size:xx-small;}
                   .article-bodytext{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; color:#222222;font-weight:normal;}
                   .ratingbyline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:50%;}
                   .author{font-family:Arial,Helvetica,sans-serif; color:#777777; font-size:50%;}
                   .date{font-family:Arial,Helvetica,sans-serif; color:#777777; font-size:50%;}
                   .padding{font-family:Arial,Helvetica,sans-serif; font-size:70%; color:#222222; font-weight:normal;}
                    '''

    preprocess_regexps = [
                         (re.compile(r'<p></p><div.*</div>', re.IGNORECASE | re.DOTALL), lambda match : r''),
                         ]
				
    keep_only_tags = [dict(name='div', attrs={'class':['padding','sidebar-photo']})]

    remove_tags = [ dict(name=['object','link','table','embed','script', 'noscript'])
                    ,dict(name='div',attrs={'id':["pluckcomments","StoryChat"]})
                    ,dict(name='div',attrs={'class':['article-tools',"padding article-sidebar",'articleflex-container','poster-container','newslist','footer-container','sidebar-related','sub']})
                    ,dict(name='p',attrs={'class':['posted','tags']})]

    feeds	= [(u'Breaking News', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSbreaking.pbs&mime=xml'),
		(u'Latest Headlines', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSlatest.pbs&mime=xml'),
		(u'Local News', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSlocal.pbs&mime=xml'),
		(u'Sports', u'http://www.postcrescent.com/apps/pbcs.dll/misc?URL=/templates/RSSsports.pbs&mime=xml'),
		(u'Buzz Blog', u'http://sitelife.postcrescent.com/ver1.0/Blog/BlogRss?plckBlogId=Blog:9a8980f0-f726-439c-8c4e-1dc0f788941e'),
		(u'Weekend Blog', u'http://sitelife.postcrescent.com/ver1.0/Blog/BlogRss?plckBlogId=Blog:9dbf4deb-0468-41b7-a0c7-3a777c03d64c')]
				

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        for item in soup.findAll(face=True):
            del item['face']
        return soup


As far as the API page you referenced, I did look that over. I, too, tried using a filter_regexps to no avail. I'll admit I haven't thoroughly studied that page thanks to a combination of confusion, frustration and tiredness. However, if you still wish to share your expertise in the parse_index... that would be wonderful.

Edit: FYI... I've been studying the pages' html code using Firebug in Firefox. If that helps.
schnortz is offline