View Single Post
Old 06-25-2012, 07:50 AM   #1
Read&Write has learned how to buy an e-book online
Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
getting rid of images: remove_tags has no effect?

I am using the builtin recipe to acquire the rss from The recipe is very useful. However, I found myself unable to the articles from any graphical elements. I thought this could be achieved by inspecting the sites html and adding the specific elements to the remove_tags-list. However, this does not help. My workaround was to make the unwanted elements invisible using the extra_css option. Still, no sucess. Did I make some sort of gigantic mistake? After tinkering with the recipe for days and having learned quite a lot of useful things, I am stuck. I humbly request your help in this matter.

In order to get a better understandig of the problem, please refer to the following sources:

An example of an article that I want to strip the picture (along with its caption) from:

The recipe I am using. It is based on the builtin one called "Sü" and unfortunately includes all the images I want to get rid of:

# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2012, Kovid Goyal <kovid at>' # 2012-01-26 AGe change to actual Year

from import BasicNewsRecipe
class Sueddeutsche(BasicNewsRecipe):

    title                 = u'Sü'                 # 2012-01-26 AGe Correct Title
    description           = 'News from Germany, Access to online content' # 2012-01-26 AGe
    __author__            = 'Oliver Niesner and Armin Geller' #Update AGe 2012-01-26
    publisher             = u'Süddeutsche Zeitung'             # 2012-01-26 AGe add
    category              = 'news, politics, Germany'         # 2012-01-26 AGe add
    timefmt               = ' [%a, %d %b %Y]'                 # 2012-01-26 AGe add %a
    oldest_article        = 2
    max_articles_per_feed = 100
    simultaneous_downloads = 75
    language              = 'de'
    encoding              = 'utf-8'
    publication_type      = 'newspaper'                         # 2012-01-26 add
    cover_source          = '' # 2012-01-26 AGe add from Darko Miletic paid content source
    masthead_url          = '' # 2012-01-26 AGe add

    use_embedded_content  = False
    no_stylesheets        = True
    remove_javascript     = True
    auto_cleanup          = True

    feeds = [
              (u'Politik', u''),
              (u'Wirtschaft', u''),
              (u'Geld', u''),
              (u'Kultur', u''),
              (u'Leben', u''),
              (u'Karriere', u''),
              (u'Bildung', u''),         #2012-01-26 AGe New
              (u'Gesundheit', u''),   #2012-01-26 AGe New
              (u'Medien', u''),
              (u'Digital', u''),
              (u'Auto', u''),
              (u'Wissen', u''),
              (u'Reise', u''),
              (u'Technik', u''), # sometimes only
# AGe 2011-12-16 Problem of Handling redirections solved by a solution of Recipes-Re-usable code from kiklop74.
# Feed is:          
# Article download source is: (Ski Alpin: Der Erfolg kommt, der Trainer geht)
# Article source is:
# Article printversion is:
    def print_version(self, url):
        main, sep, id = n_url.rpartition('/')
        return main + '/2.220/' + id

remove_tags = [dict(name='img'), dict(name='figure'), dict(name='div', attrs={'class':["headslot"] })]
extra_css = 'figure, img, .headslot, .zoomable{display:none;}'
I would be grateful for your help. I would definitely like to learn about this issue so I can avoid the problem in the future.
Read&Write is offline   Reply With Quote