Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-19-2010, 07:09 PM   #1
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Garbled characters and advertising in feeds

In the attached sample, 2 articles out of 4 have corrupted characters. I can only guess that this is binary data being (mis)parsed as text. BTW, I am not the only one seeing this behaviour.

Sometimes the articles are not corrupted, and sometimes there are different articles being corrupted. So it is not an issue with the individual articles.

Looking for this "binary data", I noticed that advertising is being injected in the RSS feeds. To this end, I have been trying to delete articles with "PUBLICIDADE" (advertising) in the title from the RSS feed.

Following the sticky, I have used:

Spoiler:
Code:
def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'PUBLICIDADE' in article.title.upper():
            feed.articles.remove(article)
      return feeds


It doesn't work. I have also tried adding the colon "PUBLICIDADE:", but it didn't work, either.

The recipe I am using is this one:

Spoiler:
Code:
#!/usr/bin/env  python
__author__    = u'Jordi Balcells'
__license__   = 'GPL v3'
description   = u'Jornal portugu\xeas - v1.04 (October 2010)'
__docformat__ = 'restructuredtext en'

'''
publico.pt
'''

from calibre.web.feeds.news import BasicNewsRecipe

class PublicoPT(BasicNewsRecipe):
    __author__   = u'Jordi Balcells'
    description  = u'Jornal portugu\xeas'
    publisher    = u'P\xfablico Comunica\xe7\xe3o Social, SA'
    cover_url    = 'http://static.publico.pt/files/header/img/publico.gif'
    title        = u'Publico.PT'
    category     = 'News, politics, culture, economy, general interest'

    language       = 'pt'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article = 2
    max_articles_per_feed = 5

    use_embedded_content  = False
    recursion             = 5

    remove_javascript = True
    no_stylesheets = True


    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

    keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia','textoPrincipal','ECOSFERA_texto_01']})]
    remove_tags    = [dict(attrs={'class':['options','subcoluna']})]

    feeds = [
                        (u'Pol\xedtica', u'http://feeds.feedburner.com/PublicoPolitica')
            ]

def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'PUBLICIDADE' in article.title.upper():
            feed.articles.remove(article)
      return feeds


Attached Files
File Type: mobi Publico_PT.mobi (51.4 KB, 326 views)
lordvetinari2 is offline   Reply With Quote
Old 10-20-2010, 09:27 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
Following the sticky, I have used:

Code:
def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'PUBLICIDADE' in article.title.upper():
            feed.articles.remove(article)
      return feeds
It doesn't work.
It would if you indented it correctly with four more spaces before each line. By not indenting it, it is outside your main recipe class and is never executed.
Starson17 is offline   Reply With Quote
Old 10-20-2010, 11:01 AM   #3
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Quote:
Originally Posted by Starson17 View Post
It would if you indented it correctly with four more spaces before each line. By not indenting it, it is outside your main recipe class and is never executed.
Thank you! I can filter articles now. I did not know indentation was vital in Python. I have only programmed a little in Java, and only used indentation to improve readability.

However, the problem is still not solved. Something else is corrupting every character in specific articles. Will keep researching.
lordvetinari2 is offline   Reply With Quote
Old 10-20-2010, 11:48 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
Thank you! I can filter articles now.
You're welcome.
Quote:
I did not know indentation was vital in Python.
It caught me by surprise, too, when I first began using Python. Indentation problems are the most common problems I run into.
Wait until you find a file with embedded tabs. They can drive you crazy.
Quote:
Something else is corrupting every character in specific articles. Will keep researching.
Let us know what you find.

Last edited by Starson17; 10-20-2010 at 12:12 PM.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
garbled screen among other issues quibard PocketBook 8 07-13-2010 03:27 AM
PRS-600 Garbled screen edlauzon Sony Reader 8 07-07-2010 07:59 AM
Calibre Conversions - Garbled text? CodeMonky Calibre 0 05-10-2010 09:43 PM
Garbled TXT SamCox Amazon Kindle 0 02-20-2010 10:39 PM
New Scientist latest news garbled daithi81 Calibre 6 11-20-2009 04:12 AM


All times are GMT -4. The time now is 07:45 PM.


MobileRead.com is a privately owned, operated and funded community.