MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Thread: Custom recipes (archive, read-only)

View Single Post

09-16-2010, 03:35 PM	#2729
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Starson17, Hey sorry to ask this question yet again. I simply am not understanding it yet even after reading the documentation and some of the code you have posted. Basically I'm wondering why this will not work... Spoiler: Code: def preprocess_html(self, soup): for credit_tag in soup.findAll('span', attrs={'class':['imageCredit rightFloat']}): p = Tag(soup, 'p') span.replaceWith(p) p.insert(0, span) return soup What I'm trying to do is search for all the span tags that contain imageCredit... and then make the span tag a <p> tag. so it will format it better. As a result though I get no soup and the article is blank Here is the full code. I was just trying to clean up the ajc recipe a little bit. Spoiler: [code] class AdvancedUserRecipe1282101454(BasicNewsRecipe): title = 'The AJC' __author__ = 'TonytheBookworm' description = 'News from Atlanta and USA' publisher = 'The Atlanta Journal' category = 'news, politics, USA' oldest_article = 1 max_articles_per_feed = 100 no_stylesheets = True masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif' extra_css = ''' h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} p{font-family:Arial,Helvetica,sans-serif;font-size:small;} body{font-family:Helvetica,Arial,sans-serif;font-size:small;} ''' keep_only_tags = [ dict(name='div', attrs={'class':['cxArticleHeader']}) ,dict(attrs={'id':['cxArticleText']}) ] remove_tags = [ dict(name='div' , attrs={'class':'cxArticleList' }) ,dict(name='div' , attrs={'class':'cxFeedTease' }) ,dict(name='div' , attrs={'class':'cxElementEnlarge' }) ,dict(name='div' , attrs={'id':'cxArticleTools' }) ] feeds = [ ('Breaking News', 'http://www.ajc.com/genericList-rss.do?source=61499'), # ------------------------------------------------------------------- # Here are the different area feeds. Choose which ever one you wish to # read by simply removing the pound sign from it. I currently have it # set to only get the Cobb area # -------------------------------------------------------------------- #('Atlanta & Fulton', 'http://www.ajc.com/section-rss.do?source=atlanta'), #('Clayton', 'http://www.ajc.com/section-rss.do?source=clayton'), #('DeKalb', 'http://www.ajc.com/section-rss.do?source=dekalb'), #('Gwinnett', 'http://www.ajc.com/section-rss.do?source=gwinnett'), #('North Fulton', 'http://www.ajc.com/section-rss.do?source=north-fulton'), #('Metro', 'http://www.ajc.com/section-rss.do?source=news'), #('Cherokee', 'http://www.ajc.com/section-rss.do?source=cherokee'), ('Cobb', 'http://www.ajc.com/section-rss.do?source=cobb'), #('Fayette', 'http://www.ajc.com/section-rss.do?source=fayette'), #('Henry', 'http://www.ajc.com/section-rss.do?source=henry'), #('Q & A', 'http://www.ajc.com/genericList-rss.do?source=77197'), ('Opinions', 'http://www.ajc.com/section-rss.do?source=opinion'), ('Ga Politics', 'http://www.ajc.com/section-rss.do?source=georgia-politics-elections'), # ------------------------------------------------------------------------ # Here are the different sports feeds. I only follow the Falcons, and Highschool # but again # You can enable which ever team you like by removing the pound sign # ------------------------------------------------------------------------ #('Sports News', 'http://www.ajc.com/genericList-rss.do?source=61510'), #('Braves', 'http://www.ajc.com/genericList-rss.do?source=61457'), ('Falcons', 'http://www.ajc.com/genericList-rss.do?source=61458'), #('Hawks', 'http://www.ajc.com/genericList-rss.do?source=61522'), #('Dawgs', 'http://www.ajc.com/genericList-rss.do?source=61492'), #('Yellowjackets', 'http://www.ajc.com/genericList-rss.do?source=61523'), ('Highschool', 'http://www.ajc.com/section-rss.do?source=high-school'), ('Events', 'http://www.accessatlanta.com/section-rss.do?source=events'), ('Music', 'http://www.accessatlanta.com/section-rss.do?source=music'), ] def preprocess_html(self, soup): for credit_tag in soup.findAll('span', attrs={'class':['imageCredit rightFloat']}): p = Tag(soup, 'p') span.replaceWith(p) p.insert(0, span) return soup #def print_version(self, url): # return url.partition('?')[0] +'?printArticle=y' [/code