Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-20-2011, 07:16 PM   #1
davidnye
Member
davidnye began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Aug 2011
Device: Nook
Converting non-ASCII characters

I wrote a recipe for Madison.com (content from Cap Times and Wisconsin State Journal) which works well except that all apostrophes appear to have been replaced with 'â ', both in titles and in articles. I've tried using preprocess_regexps to correct this, which seems to work for about the first 75 pages of news. After that, although the titles listed under 'Contents' have apostrophes, the title at the start of the article and the article content have 'â 's. Any ideas? Is there a better way to replace all instances of certain characters or phrases in downloaded content? Here is the recipe:

Spoiler:
import re

class AdvancedUserRecipe1313121904(BasicNewsRecipe):
title = u'Madison.com'
oldest_article = 2
max_articles_per_feed = 20
no_stylesheets = True
remove_empty_feeds = True
preprocess_regexps = [(re.compile(r'â ', re.DOTALL|re.IGNORECASE), lambda match: '\''),]

feeds = [
(u'National News', u'http://host.madison.com/search/?f=rss&t=article&l=25&s=start_time&sd=desc&c=news/national*'),
(u'State and Regional', u'http://host.madison.com/search/?f=rss&t=article&c=news/state_and_regional&q=%23wsj&l=25&s=start_time&sd=d esc'),
(u'Editorials', u'http://host.madison.com/search/?f=rss&t=article&c=news/opinion/editorial&q=%23ct&l=25&s=start_time&sd=desc'),
(u'Columns', u'http://host.madison.com/search/?f=rss&t=article&l=25&s=start_time&sd=desc&c=news/opinion/column*&q=%23ct'),
(u'Education', u'http://host.madison.com/search/?f=rss&t=article&c=news/local/education&q=%23wsj&l=25&s=start_time&sd=desc'),
(u'Science and Nature', u'http://host.madison.com/search/?f=rss&t=article&c=news/local/environment&q=%23wsj&l=25&s=start_time&sd=desc'),
(u'Health and Medicine', u'http://host.madison.com/search/?f=rss&t=article&c=news/local/health_med_fit&q=%23wsj&l=25&s=start_time&sd=desc' ),
(u'Tech', u'http://host.madison.com/search/?f=rss&t=article&c=business/technology&q=%23ct&l=25&s=start_time&sd=desc')
]

def print_version(self, url):
return url.replace('html', 'html?print=1')
davidnye is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Non-ASCII characters in recipe titles show as ü bubak Recipes 2 11-30-2011 07:49 AM
advanced text search and non-ascii characters msz59 General Discussions 0 05-05-2011 09:47 AM
non-ASCII characters show up as question marks on my Reader (from FAQ) Candoumi Sigil 2 04-07-2011 08:44 PM
Typing non-ASCII characters with the keyboard Edmundo Amazon Kindle 5 01-20-2011 01:18 PM
Is it possible to sent books to device with filename in non-ascii characters? flyisland Calibre 8 10-16-2010 05:35 AM


All times are GMT -4. The time now is 07:19 PM.


MobileRead.com is a privately owned, operated and funded community.