View Single Post
Old 10-19-2010, 07:09 PM   #1
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Garbled characters and advertising in feeds

In the attached sample, 2 articles out of 4 have corrupted characters. I can only guess that this is binary data being (mis)parsed as text. BTW, I am not the only one seeing this behaviour.

Sometimes the articles are not corrupted, and sometimes there are different articles being corrupted. So it is not an issue with the individual articles.

Looking for this "binary data", I noticed that advertising is being injected in the RSS feeds. To this end, I have been trying to delete articles with "PUBLICIDADE" (advertising) in the title from the RSS feed.

Following the sticky, I have used:

Spoiler:
Code:
def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'PUBLICIDADE' in article.title.upper():
            feed.articles.remove(article)
      return feeds


It doesn't work. I have also tried adding the colon "PUBLICIDADE:", but it didn't work, either.

The recipe I am using is this one:

Spoiler:
Code:
#!/usr/bin/env  python
__author__    = u'Jordi Balcells'
__license__   = 'GPL v3'
description   = u'Jornal portugu\xeas - v1.04 (October 2010)'
__docformat__ = 'restructuredtext en'

'''
publico.pt
'''

from calibre.web.feeds.news import BasicNewsRecipe

class PublicoPT(BasicNewsRecipe):
    __author__   = u'Jordi Balcells'
    description  = u'Jornal portugu\xeas'
    publisher    = u'P\xfablico Comunica\xe7\xe3o Social, SA'
    cover_url    = 'http://static.publico.pt/files/header/img/publico.gif'
    title        = u'Publico.PT'
    category     = 'News, politics, culture, economy, general interest'

    language       = 'pt'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article = 2
    max_articles_per_feed = 5

    use_embedded_content  = False
    recursion             = 5

    remove_javascript = True
    no_stylesheets = True


    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

    keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia','textoPrincipal','ECOSFERA_texto_01']})]
    remove_tags    = [dict(attrs={'class':['options','subcoluna']})]

    feeds = [
                        (u'Pol\xedtica', u'http://feeds.feedburner.com/PublicoPolitica')
            ]

def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'PUBLICIDADE' in article.title.upper():
            feed.articles.remove(article)
      return feeds


Attached Files
File Type: mobi Publico_PT.mobi (51.4 KB, 376 views)
lordvetinari2 is offline   Reply With Quote