View Single Post
Old 01-18-2012, 11:36 AM   #1
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Unhappy Replacing item with while using auto_cleanup = True

Hi All,

I'm developing a new recipe for a subsription required hungarian website, and I'm in an almost final stage (generated feed from the index, fetching articles is OK).

I'm using auto_cleanup = True to create readable articles which work rather well and I'm happy with the output.

My only remaining issue is, that although I had set up some regex based removal like this:

Spoiler:
Code:
preprocess_regexps = [ (re.compile(r'<!--.*?-->', re.DOTALL), lambda m: ''),
                           (re.compile(r'<p align="left"'), lambda m: '<p'),
                           (re.compile(r'<a href="/"><img src="images/logo.jpg".*?/></a>'), lambda m: ''),
                           (re.compile(r'<a href="/"><img src="images/logo.jpg".*?/></a>'), lambda m: ''),
                           (re.compile(r'<a href="javascript:changeFontSize.*?/></a>', re.DOTALL), lambda m: ''),
                           (re.compile(r'\| ÉLET ÉS IRODALOM</title>'), lambda m: '</title>')
                         ]


It looks like it does not replaces (especially the last line) anything and I don't know why.

It's important as I had noticed the articles title cames from the page's <title> tags. And for some reason the original <title> tags on the article's page contains that unnecessary uppercase text (with a | in front of it). Can someone give me a hint how to remove that?
hiperlink is offline   Reply With Quote