In the attached sample, 2 articles out of 4 have corrupted characters. I can only guess that this is binary data being (mis)parsed as text. BTW, I am
not the only one seeing this behaviour.
Sometimes the articles are not corrupted, and sometimes there are different articles being corrupted. So it is not an issue with the individual articles.
Looking for this "binary data", I noticed that advertising is being injected
in the RSS feeds. To this end, I have been trying to delete articles with "PUBLICIDADE" (advertising) in the title from the RSS feed.
Following the sticky, I have used:
It doesn't work. I have also tried adding the colon "PUBLICIDADE
:", but it didn't work, either.
The recipe I am using is this one:
Spoiler:
Code:
#!/usr/bin/env python
__author__ = u'Jordi Balcells'
__license__ = 'GPL v3'
description = u'Jornal portugu\xeas - v1.04 (October 2010)'
__docformat__ = 'restructuredtext en'
'''
publico.pt
'''
from calibre.web.feeds.news import BasicNewsRecipe
class PublicoPT(BasicNewsRecipe):
__author__ = u'Jordi Balcells'
description = u'Jornal portugu\xeas'
publisher = u'P\xfablico Comunica\xe7\xe3o Social, SA'
cover_url = 'http://static.publico.pt/files/header/img/publico.gif'
title = u'Publico.PT'
category = 'News, politics, culture, economy, general interest'
language = 'pt'
timefmt = '[%a, %d %b, %Y]'
oldest_article = 2
max_articles_per_feed = 5
use_embedded_content = False
recursion = 5
remove_javascript = True
no_stylesheets = True
extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia','textoPrincipal','ECOSFERA_texto_01']})]
remove_tags = [dict(attrs={'class':['options','subcoluna']})]
feeds = [
(u'Pol\xedtica', u'http://feeds.feedburner.com/PublicoPolitica')
]
def parse_feeds (self):
feeds = BasicNewsRecipe.parse_feeds(self)
for feed in feeds:
for article in feed.articles[:]:
print 'article.title is: ', article.title
if 'PUBLICIDADE' in article.title.upper():
feed.articles.remove(article)
return feeds