View Full Version : Garbled characters and advertising in feeds


lordvetinari2
10-19-2010, 08:09 PM
In the attached sample, 2 articles out of 4 have corrupted characters. I can only guess that this is binary data being (mis)parsed as text. BTW, I am not the only one (http://www.mobileread.com/forums/showthread.php?t=102859) seeing this behaviour.

Sometimes the articles are not corrupted, and sometimes there are different articles being corrupted. So it is not an issue with the individual articles.

Looking for this "binary data", I noticed that advertising is being injected in the RSS feeds (http://feeds.feedburner.com/PublicoPolitica). To this end, I have been trying to delete articles with "PUBLICIDADE" (advertising) in the title from the RSS feed.

Following the sticky, I have used:

def parse_feeds (self):
feeds = BasicNewsRecipe.parse_feeds(self)
for feed in feeds:
for article in feed.articles[:]:
print 'article.title is: ', article.title
if 'PUBLICIDADE' in article.title.upper():
feed.articles.remove(article)
return feeds

It doesn't work. I have also tried adding the colon "PUBLICIDADE:", but it didn't work, either.

The recipe I am using is this one:

#!/usr/bin/env python
__author__ = u'Jordi Balcells'
__license__ = 'GPL v3'
description = u'Jornal portugu\xeas - v1.04 (October 2010)'
__docformat__ = 'restructuredtext en'

'''
publico.pt
'''

from calibre.web.feeds.news import BasicNewsRecipe

class PublicoPT(BasicNewsRecipe):
__author__ = u'Jordi Balcells'
description = u'Jornal portugu\xeas'
publisher = u'P\xfablico Comunica\xe7\xe3o Social, SA'
cover_url = 'http://static.publico.pt/files/header/img/publico.gif'
title = u'Publico.PT'
category = 'News, politics, culture, economy, general interest'

language = 'pt'
timefmt = '[%a, %d %b, %Y]'

oldest_article = 2
max_articles_per_feed = 5

use_embedded_content = False
recursion = 5

remove_javascript = True
no_stylesheets = True


extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia ','textoPrincipal','ECOSFERA_texto_01']})]
remove_tags = [dict(attrs={'class':['options','subcoluna']})]

feeds = [
(u'Pol\xedtica', u'http://feeds.feedburner.com/PublicoPolitica')
]

def parse_feeds (self):
feeds = BasicNewsRecipe.parse_feeds(self)
for feed in feeds:
for article in feed.articles[:]:
print 'article.title is: ', article.title
if 'PUBLICIDADE' in article.title.upper():
feed.articles.remove(article)
return feeds

:help:

Starson17
10-20-2010, 10:27 AM
Following the sticky, I have used:

def parse_feeds (self):
feeds = BasicNewsRecipe.parse_feeds(self)
for feed in feeds:
for article in feed.articles[:]:
print 'article.title is: ', article.title
if 'PUBLICIDADE' in article.title.upper():
feed.articles.remove(article)
return feeds
It doesn't work.
:help:
It would if you indented it correctly with four more spaces before each line. By not indenting it, it is outside your main recipe class and is never executed.

lordvetinari2
10-20-2010, 12:01 PM
It would if you indented it correctly with four more spaces before each line. By not indenting it, it is outside your main recipe class and is never executed.

Thank you! I can filter articles now. I did not know indentation was vital in Python. I have only programmed a little in Java, and only used indentation to improve readability.

However, the problem is still not solved. Something else is corrupting every character in specific articles. Will keep researching.

Starson17
10-20-2010, 12:48 PM
Thank you! I can filter articles now.
You're welcome.
I did not know indentation was vital in Python.
It caught me by surprise, too, when I first began using Python. Indentation problems are the most common problems I run into.
Wait until you find a file with embedded tabs. They can drive you crazy.
Something else is corrupting every character in specific articles. Will keep researching.
Let us know what you find.