getting rid of images: remove_tags has no effect?

Read&Write · 06-25-2012, 08:50 AM

I am using the builtin recipe to acquire the rss from sueddeutsche.de. The recipe is very useful. However, I found myself unable to the articles from any graphical elements. I thought this could be achieved by inspecting the sites html and adding the specific elements to the remove_tags-list. However, this does not help. My workaround was to make the unwanted elements invisible using the extra_css option. Still, no sucess. Did I make some sort of gigantic mistake? After tinkering with the recipe for days and having learned quite a lot of useful things, I am stuck. I humbly request your help in this matter.

In order to get a better understandig of the problem, please refer to the following sources:

An example of an article that I want to strip the picture (along with its caption) from:
http://www.sueddeutsche.de/politik/2...tion-1.1392387

The recipe I am using. It is based on the builtin one called "Süddeutsche.de" and unfortunately includes all the images I want to get rid of:

Code:

# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2012, Kovid Goyal <kovid at kovidgoyal.net>' # 2012-01-26 AGe change to actual Year

'''
Fetch sueddeutsche.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
class Sueddeutsche(BasicNewsRecipe):

    title                 = u'Süddeutsche.de'                 # 2012-01-26 AGe Correct Title
    description           = 'News from Germany, Access to online content' # 2012-01-26 AGe
    __author__            = 'Oliver Niesner and Armin Geller' #Update AGe 2012-01-26
    publisher             = u'Süddeutsche Zeitung'             # 2012-01-26 AGe add
    category              = 'news, politics, Germany'         # 2012-01-26 AGe add
    timefmt               = ' [%a, %d %b %Y]'                 # 2012-01-26 AGe add %a
    oldest_article        = 2
    max_articles_per_feed = 100
    simultaneous_downloads = 75
    language              = 'de'
    encoding              = 'utf-8'
    publication_type      = 'newspaper'                         # 2012-01-26 add
    cover_source          = 'http://www.sueddeutsche.de/verlag' # 2012-01-26 AGe add from Darko Miletic paid content source
    masthead_url          = 'http://www.sueddeutsche.de/static_assets/build/img/sdesiteheader/logo_homepage.441d531c.png' # 2012-01-26 AGe add

    use_embedded_content  = False
    no_stylesheets        = True
    remove_javascript     = True
    auto_cleanup          = True

    feeds = [
              (u'Politik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EPolitik%24?output=rss'),
              (u'Wirtschaft', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWirtschaft%24?output=rss'),
              (u'Geld', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EGeld%24?output=rss'),
              (u'Kultur', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKultur%24?output=rss'),
              (u'Leben', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ELeben%24?output=rss'),
              (u'Karriere', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKarriere%24?output=rss'),
              (u'Bildung', u'http://rss.sueddeutsche.de/rss/bildung'),         #2012-01-26 AGe New
              (u'Gesundheit', u'http://rss.sueddeutsche.de/rss/gesundheit'),   #2012-01-26 AGe New
              (u'Medien', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EMedien%24?output=rss'),
              (u'Digital', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EDigital%24?output=rss'),
              (u'Auto', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EAuto%24?output=rss'),
              (u'Wissen', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWissen%24?output=rss'),
              (u'Reise', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EReise%24?output=rss'),
              (u'Technik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ETechnik%24?output=rss'), # sometimes only
            ]
# AGe 2011-12-16 Problem of Handling redirections solved by a solution of Recipes-Re-usable code from kiklop74.
# Feed is:                    http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ESport%24?output=rss
# Article download source is: http://sz.de/1.1237295 (Ski Alpin: Der Erfolg kommt, der Trainer geht)
# Article source is:          http://www.sueddeutsche.de/sport/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
# Article printversion is:    http://www.sueddeutsche.de/sport/2.220/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
    def print_version(self, url):
        n_url=self.browser.open_novisit(url).geturl()
        main, sep, id = n_url.rpartition('/')
        return main + '/2.220/' + id

remove_tags = [dict(name='img'), dict(name='figure'), dict(name='div', attrs={'class':["headslot"] })]
extra_css = 'figure, img, .headslot, .zoomable{display:none;}'

I would be grateful for your help. I would definitely like to learn about this issue so I can avoid the problem in the future.

NotTaken · 06-25-2012, 02:59 PM

Your indentation is wrong. You need to indent remove_tags and extra_css four spaces so they become class members.

Read&Write · 06-26-2012, 02:27 PM

Thank you, that was the problem. The most basic of mistakes, it seems

06-26-2012, 02:27 PM	#3
Read&Write Enthusiast Posts: 26 Karma: 86 Join Date: Jun 2012 Device: Onyx M92	Problem solved Thank you, that was the problem. The most basic of mistakes, it seems

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Priority between keep_only_tags and remove_tags	BruceBerry	Recipes	1	11-19-2011 04:10 PM
Affect or effect	mr ploppy	Writers' Corner	6	07-20-2011 05:00 PM
remove_tags does not work	JFS-NMF	Recipes	1	03-04-2011 02:56 PM
Help Please: remove_tags doesn't work in WSJ Chinese	Jmot	Recipes	5	02-21-2011 05:10 AM
Effect of MR Promotion, or not?	ASparrow	Writers' Corner	51	11-26-2010 06:23 PM

06-25-2012, 02:59 PM	#2
NotTaken Connoisseur Posts: 65 Karma: 4640 Join Date: Aug 2011 Device: kindle	Your indentation is wrong. You need to indent remove_tags and extra_css four spaces so they become class members.

Advert