View Single Post
Old 06-25-2012, 08:50 AM   #1
Read&Write
Enthusiast
Read&Write has learned how to buy an e-book online
 
Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
getting rid of images: remove_tags has no effect?

I am using the builtin recipe to acquire the rss from sueddeutsche.de. The recipe is very useful. However, I found myself unable to the articles from any graphical elements. I thought this could be achieved by inspecting the sites html and adding the specific elements to the remove_tags-list. However, this does not help. My workaround was to make the unwanted elements invisible using the extra_css option. Still, no sucess. Did I make some sort of gigantic mistake? After tinkering with the recipe for days and having learned quite a lot of useful things, I am stuck. I humbly request your help in this matter.

In order to get a better understandig of the problem, please refer to the following sources:

An example of an article that I want to strip the picture (along with its caption) from:
http://www.sueddeutsche.de/politik/2...tion-1.1392387

The recipe I am using. It is based on the builtin one called "Süddeutsche.de" and unfortunately includes all the images I want to get rid of:

Code:
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2012, Kovid Goyal <kovid at kovidgoyal.net>' # 2012-01-26 AGe change to actual Year

'''
Fetch sueddeutsche.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
class Sueddeutsche(BasicNewsRecipe):

    title                 = u'Süddeutsche.de'                 # 2012-01-26 AGe Correct Title
    description           = 'News from Germany, Access to online content' # 2012-01-26 AGe
    __author__            = 'Oliver Niesner and Armin Geller' #Update AGe 2012-01-26
    publisher             = u'Süddeutsche Zeitung'             # 2012-01-26 AGe add
    category              = 'news, politics, Germany'         # 2012-01-26 AGe add
    timefmt               = ' [%a, %d %b %Y]'                 # 2012-01-26 AGe add %a
    oldest_article        = 2
    max_articles_per_feed = 100
    simultaneous_downloads = 75
    language              = 'de'
    encoding              = 'utf-8'
    publication_type      = 'newspaper'                         # 2012-01-26 add
    cover_source          = 'http://www.sueddeutsche.de/verlag' # 2012-01-26 AGe add from Darko Miletic paid content source
    masthead_url          = 'http://www.sueddeutsche.de/static_assets/build/img/sdesiteheader/logo_homepage.441d531c.png' # 2012-01-26 AGe add

    use_embedded_content  = False
    no_stylesheets        = True
    remove_javascript     = True
    auto_cleanup          = True

    feeds = [
              (u'Politik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EPolitik%24?output=rss'),
              (u'Wirtschaft', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWirtschaft%24?output=rss'),
              (u'Geld', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EGeld%24?output=rss'),
              (u'Kultur', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKultur%24?output=rss'),
              (u'Leben', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ELeben%24?output=rss'),
              (u'Karriere', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKarriere%24?output=rss'),
              (u'Bildung', u'http://rss.sueddeutsche.de/rss/bildung'),         #2012-01-26 AGe New
              (u'Gesundheit', u'http://rss.sueddeutsche.de/rss/gesundheit'),   #2012-01-26 AGe New
              (u'Medien', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EMedien%24?output=rss'),
              (u'Digital', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EDigital%24?output=rss'),
              (u'Auto', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EAuto%24?output=rss'),
              (u'Wissen', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWissen%24?output=rss'),
              (u'Reise', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EReise%24?output=rss'),
              (u'Technik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ETechnik%24?output=rss'), # sometimes only
            ]
# AGe 2011-12-16 Problem of Handling redirections solved by a solution of Recipes-Re-usable code from kiklop74.
# Feed is:                    http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ESport%24?output=rss
# Article download source is: http://sz.de/1.1237295 (Ski Alpin: Der Erfolg kommt, der Trainer geht)
# Article source is:          http://www.sueddeutsche.de/sport/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
# Article printversion is:    http://www.sueddeutsche.de/sport/2.220/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
    def print_version(self, url):
        n_url=self.browser.open_novisit(url).geturl()
        main, sep, id = n_url.rpartition('/')
        return main + '/2.220/' + id

remove_tags = [dict(name='img'), dict(name='figure'), dict(name='div', attrs={'class':["headslot"] })]
extra_css = 'figure, img, .headslot, .zoomable{display:none;}'
I would be grateful for your help. I would definitely like to learn about this issue so I can avoid the problem in the future.
Read&Write is offline   Reply With Quote