View Single Post
Old 05-17-2010, 05:10 PM   #1929
mwheinz
award-winning bozo
mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.mwheinz can grok the meaning of the universe.
 
Posts: 241
Karma: 157113
Join Date: Sep 2009
Location: Philadelphia
Device: Sony PRS-600
American Prospect Recipe

American Prospect Recipe

sdow1 - try this recipe. It's very simple, strips out all formatting at the moment.

Code:
import re

class AdvancedUserRecipe1273850169(BasicNewsRecipe):
    title          = u'American Prospect'
    oldest_article = 7
    max_articles_per_feed = 100
    recursions = 0
    no_stylesheets = True
    remove_javascript = True

    keep_only_tags = [dict(name=['p','img'])]
	
    preprocess_regexps = [ 
        (re.compile('\r'),lambda match: ''),
        (re.compile(r'<head.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'),
        (re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>'),
        (re.compile(r'<body.*?<div class="pad_10L10R">', re.DOTALL|re.IGNORECASE), lambda match: '<body><div>'),
        (re.compile(r'</div>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</div></body>'),
    ]

    feeds       = [(u'Articles', u'feed://www.prospect.org/articles_rss.jsp')]

Last edited by mwheinz; 05-17-2010 at 07:44 PM.
mwheinz is offline