View Single Post
Old 08-20-2011, 07:16 PM   #1
davidnye
Member
davidnye began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Aug 2011
Device: Nook
Converting non-ASCII characters

I wrote a recipe for Madison.com (content from Cap Times and Wisconsin State Journal) which works well except that all apostrophes appear to have been replaced with 'â ', both in titles and in articles. I've tried using preprocess_regexps to correct this, which seems to work for about the first 75 pages of news. After that, although the titles listed under 'Contents' have apostrophes, the title at the start of the article and the article content have 'â 's. Any ideas? Is there a better way to replace all instances of certain characters or phrases in downloaded content? Here is the recipe:

Spoiler:
import re

class AdvancedUserRecipe1313121904(BasicNewsRecipe):
title = u'Madison.com'
oldest_article = 2
max_articles_per_feed = 20
no_stylesheets = True
remove_empty_feeds = True
preprocess_regexps = [(re.compile(r'â ', re.DOTALL|re.IGNORECASE), lambda match: '\''),]

feeds = [
(u'National News', u'http://host.madison.com/search/?f=rss&t=article&l=25&s=start_time&sd=desc&c=news/national*'),
(u'State and Regional', u'http://host.madison.com/search/?f=rss&t=article&c=news/state_and_regional&q=%23wsj&l=25&s=start_time&sd=d esc'),
(u'Editorials', u'http://host.madison.com/search/?f=rss&t=article&c=news/opinion/editorial&q=%23ct&l=25&s=start_time&sd=desc'),
(u'Columns', u'http://host.madison.com/search/?f=rss&t=article&l=25&s=start_time&sd=desc&c=news/opinion/column*&q=%23ct'),
(u'Education', u'http://host.madison.com/search/?f=rss&t=article&c=news/local/education&q=%23wsj&l=25&s=start_time&sd=desc'),
(u'Science and Nature', u'http://host.madison.com/search/?f=rss&t=article&c=news/local/environment&q=%23wsj&l=25&s=start_time&sd=desc'),
(u'Health and Medicine', u'http://host.madison.com/search/?f=rss&t=article&c=news/local/health_med_fit&q=%23wsj&l=25&s=start_time&sd=desc' ),
(u'Tech', u'http://host.madison.com/search/?f=rss&t=article&c=business/technology&q=%23ct&l=25&s=start_time&sd=desc')
]

def print_version(self, url):
return url.replace('html', 'html?print=1')
davidnye is offline   Reply With Quote