Problem with preprocess_regexps and Unicode

mccande · 12-16-2008, 11:03 AM

I am preparing a recipe for a Belgian newspaper where I have to replace a styled apostrophe with a simple one (Unicode characters 0x92 and 0x27)

The formula I use is

preprocess_regexps = [
(re.compile(ru'\0092'), lambda match: ru'\u0027')
]

but I cannot get the epub2disk to start. I always receive the standard error message
C:\Documents and Settings\Denis\test>feeds2disk --debug --test libe.py
Traceback (most recent call last):
File "main.py", line 167, in <module>
File "main.py", line 162, in main
File "main.py", line 133, in run_recipe
File "calibre\web\feeds\recipes\__init__.pyo", line 80, in compile_recipe
File "c:\docume~1\denis\locals~1\temp\calibre_0.4.115_s _e8f1_recipes\recipe1.p
y", line 4, in <module>
libe.py
NameError: name 'libe' is not defined

What is wrong with the use of regexp?

kovidgoyal · 12-16-2008, 11:39 AM

post the full recipe

mccande · 12-17-2008, 10:53 AM

Here is the recipe which works without the regex part.

kovidgoyal · 12-17-2008, 11:58 AM

The first thing I see wrong is

(re.compile(ru'\0092'), lambda match: ru'\u0027')
should be

(re.compile(ru'\u0092'), lambda match: ru'\u0027')

Note the missing u

mccande · 12-18-2008, 03:24 AM

Thanks but it still does not work

kiklop74 · 12-18-2008, 04:28 AM

Quote:

Originally Posted by mccande

Thanks but it still does not work

You are missing this at the start of your script:

Code:

import string, re

class AdvancedUserRecipe1229426345(BasicNewsRecipe):
....

mccande · 12-18-2008, 04:53 PM

This still does not start

import string, re
class AdvancedUserRecipe1229426345(BasicNewsRecipe):
title = u'La Libre Belgique'
__author__ = 'Denis McCann'
oldest_article = 1
max_articles_per_feed = 100
use_embedded_content = False
no_stylesheets = True
simultaneous_downloads = 1

remove_tags_after = [dict(id='articleText')]

preprocess_regexps = [
(re.compile(ru'\u0092'), lambda match: ru'\u0027')
]

keep_only_tags = [
dict(name='p', attrs={'id':'avantTitre'}),
dict(name='p', attrs={'id':'writer'}),
dict(name='p', attrs={'id':'publicationDate'}),
dict(name='div', attrs={'id':'articleHat'}),
dict(name='div', attrs={'id':'c'}),
dict(name='div', attrs={'id':'articleText'})
]

feeds = [
(u'A la Une', u'http://www.lalibre.be/rss/?section=10'),
(u'Belgique', u'http://www.lalibre.be/rss/?section=10&subsection=90'),
(u'Europe', u'http://www.lalibre.be/rss/?section=10&subsection=91'),
(u'Bruxelles', u'http://www.lalibre.be/rss/?section=10&subsection=1083'),
(u'Brabant', u'http://www.lalibre.be/rss/?section=10&subsection=1106'),
(u'Economie', u'http://www.lalibre.be/rss/?section=3'),
(u'Opinion', u'http://www.lalibre.be/rss/?section=11&subsection=118')
]

kiklop74 · 12-18-2008, 05:18 PM

Change the regular expression to look like this and it will work:

Code:

    
preprocess_regexps = [(re.compile(u'\u0092'), lambda match: u'\u0027')]

Note the absence of r. String can be unicode or raw but not both.

mccande · 12-19-2008, 09:26 AM

Thanks a lot. That works and will be useful for other feeds.

The syntax of this function is far from obvious.

12-16-2008, 11:03 AM	#1
mccande Member Posts: 13 Karma: 10 Join Date: Oct 2008 Device: PRS-505	Problem with preprocess_regexps and Unicode I am preparing a recipe for a Belgian newspaper where I have to replace a styled apostrophe with a simple one (Unicode characters 0x92 and 0x27) The formula I use is preprocess_regexps = [ (re.compile(ru'\0092'), lambda match: ru'\u0027') ] but I cannot get the epub2disk to start. I always receive the standard error message C:\Documents and Settings\Denis\test>feeds2disk --debug --test libe.py Traceback (most recent call last): File "main.py", line 167, in <module> File "main.py", line 162, in main File "main.py", line 133, in run_recipe File "calibre\web\feeds\recipes\__init__.pyo", line 80, in compile_recipe File "c:\docume~1\denis\locals~1\temp\calibre_0.4.115_s _e8f1_recipes\recipe1.p y", line 4, in <module> libe.py NameError: name 'libe' is not defined What is wrong with the use of regexp?

12-18-2008, 03:24 AM	#5
mccande Member Posts: 13 Karma: 10 Join Date: Oct 2008 Device: PRS-505	Regex Thanks but it still does not work

12-18-2008, 05:18 PM	#8
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	Change the regular expression to look like this and it will work: Code: preprocess_regexps = [(re.compile(u'\u0092'), lambda match: u'\u0027')] Note the absence of r. String can be unicode or raw but not both.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Unicode support in K3	tomsem	Amazon Kindle	22	09-02-2010 04:14 PM
Hacks 2.52 with unicode-fonts-hack?	yuenslhk	Amazon Kindle	4	06-17-2010 07:00 PM
PRS-500 Unicode Enabled RTF	Honza	Sony Reader Dev Corner	33	03-31-2010 09:45 AM
Python Unicode Demystified	ahi	Workshop	2	09-18-2009 12:45 PM
Unicode errors in isbndb	JvdW	Calibre	3	08-01-2008 05:07 AM

12-16-2008, 11:39 AM	#2
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	post the full recipe

12-17-2008, 11:58 AM	#4
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The first thing I see wrong is (re.compile(ru'\0092'), lambda match: ru'\u0027') should be (re.compile(ru'\u0092'), lambda match: ru'\u0027') Note the missing u

12-18-2008, 04:53 PM	#7
mccande Member Posts: 13 Karma: 10 Join Date: Oct 2008 Device: PRS-505	This still does not start import string, re class AdvancedUserRecipe1229426345(BasicNewsRecipe): title = u'La Libre Belgique' __author__ = 'Denis McCann' oldest_article = 1 max_articles_per_feed = 100 use_embedded_content = False no_stylesheets = True simultaneous_downloads = 1 remove_tags_after = [dict(id='articleText')] preprocess_regexps = [ (re.compile(ru'\u0092'), lambda match: ru'\u0027') ] keep_only_tags = [ dict(name='p', attrs={'id':'avantTitre'}), dict(name='p', attrs={'id':'writer'}), dict(name='p', attrs={'id':'publicationDate'}), dict(name='div', attrs={'id':'articleHat'}), dict(name='div', attrs={'id':'c'}), dict(name='div', attrs={'id':'articleText'}) ] feeds = [ (u'A la Une', u'http://www.lalibre.be/rss/?section=10'), (u'Belgique', u'http://www.lalibre.be/rss/?section=10&subsection=90'), (u'Europe', u'http://www.lalibre.be/rss/?section=10&subsection=91'), (u'Bruxelles', u'http://www.lalibre.be/rss/?section=10&subsection=1083'), (u'Brabant', u'http://www.lalibre.be/rss/?section=10&subsection=1106'), (u'Economie', u'http://www.lalibre.be/rss/?section=3'), (u'Opinion', u'http://www.lalibre.be/rss/?section=11&subsection=118') ]

12-19-2008, 09:26 AM	#9
mccande Member Posts: 13 Karma: 10 Join Date: Oct 2008 Device: PRS-505	Thanks a lot. That works and will be useful for other feeds. The syntax of this function is far from obvious.

Advert

Advert