![]() |
#1 |
Enthusiast
![]() Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
|
getting rid of images: remove_tags has no effect?
I am using the builtin recipe to acquire the rss from sueddeutsche.de. The recipe is very useful. However, I found myself unable to the articles from any graphical elements. I thought this could be achieved by inspecting the sites html and adding the specific elements to the remove_tags-list. However, this does not help. My workaround was to make the unwanted elements invisible using the extra_css option. Still, no sucess. Did I make some sort of gigantic mistake? After tinkering with the recipe for days and having learned quite a lot of useful things, I am stuck. I humbly request your help in this matter.
In order to get a better understandig of the problem, please refer to the following sources: An example of an article that I want to strip the picture (along with its caption) from: http://www.sueddeutsche.de/politik/2...tion-1.1392387 The recipe I am using. It is based on the builtin one called "Süddeutsche.de" and unfortunately includes all the images I want to get rid of: Code:
# -*- coding: utf-8 -*- __license__ = 'GPL v3' __copyright__ = '2012, Kovid Goyal <kovid at kovidgoyal.net>' # 2012-01-26 AGe change to actual Year ''' Fetch sueddeutsche.de ''' from calibre.web.feeds.news import BasicNewsRecipe class Sueddeutsche(BasicNewsRecipe): title = u'Süddeutsche.de' # 2012-01-26 AGe Correct Title description = 'News from Germany, Access to online content' # 2012-01-26 AGe __author__ = 'Oliver Niesner and Armin Geller' #Update AGe 2012-01-26 publisher = u'Süddeutsche Zeitung' # 2012-01-26 AGe add category = 'news, politics, Germany' # 2012-01-26 AGe add timefmt = ' [%a, %d %b %Y]' # 2012-01-26 AGe add %a oldest_article = 2 max_articles_per_feed = 100 simultaneous_downloads = 75 language = 'de' encoding = 'utf-8' publication_type = 'newspaper' # 2012-01-26 add cover_source = 'http://www.sueddeutsche.de/verlag' # 2012-01-26 AGe add from Darko Miletic paid content source masthead_url = 'http://www.sueddeutsche.de/static_assets/build/img/sdesiteheader/logo_homepage.441d531c.png' # 2012-01-26 AGe add use_embedded_content = False no_stylesheets = True remove_javascript = True auto_cleanup = True feeds = [ (u'Politik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EPolitik%24?output=rss'), (u'Wirtschaft', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWirtschaft%24?output=rss'), (u'Geld', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EGeld%24?output=rss'), (u'Kultur', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKultur%24?output=rss'), (u'Leben', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ELeben%24?output=rss'), (u'Karriere', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKarriere%24?output=rss'), (u'Bildung', u'http://rss.sueddeutsche.de/rss/bildung'), #2012-01-26 AGe New (u'Gesundheit', u'http://rss.sueddeutsche.de/rss/gesundheit'), #2012-01-26 AGe New (u'Medien', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EMedien%24?output=rss'), (u'Digital', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EDigital%24?output=rss'), (u'Auto', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EAuto%24?output=rss'), (u'Wissen', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWissen%24?output=rss'), (u'Reise', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EReise%24?output=rss'), (u'Technik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ETechnik%24?output=rss'), # sometimes only ] # AGe 2011-12-16 Problem of Handling redirections solved by a solution of Recipes-Re-usable code from kiklop74. # Feed is: http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ESport%24?output=rss # Article download source is: http://sz.de/1.1237295 (Ski Alpin: Der Erfolg kommt, der Trainer geht) # Article source is: http://www.sueddeutsche.de/sport/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295 # Article printversion is: http://www.sueddeutsche.de/sport/2.220/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295 def print_version(self, url): n_url=self.browser.open_novisit(url).geturl() main, sep, id = n_url.rpartition('/') return main + '/2.220/' + id remove_tags = [dict(name='img'), dict(name='figure'), dict(name='div', attrs={'class':["headslot"] })] extra_css = 'figure, img, .headslot, .zoomable{display:none;}' ![]() |
![]() |
![]() |
![]() |
#2 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
|
Your indentation is wrong. You need to indent remove_tags and extra_css four spaces so they become class members.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
|
Problem solved
Thank you, that was the problem. The most basic of mistakes, it seems
![]() |
![]() |
![]() |
![]() |
Tags |
recipe |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Priority between keep_only_tags and remove_tags | BruceBerry | Recipes | 1 | 11-19-2011 03:10 PM |
Affect or effect | mr ploppy | Writers' Corner | 6 | 07-20-2011 04:00 PM |
remove_tags does not work | JFS-NMF | Recipes | 1 | 03-04-2011 01:56 PM |
Help Please: remove_tags doesn't work in WSJ Chinese | Jmot | Recipes | 5 | 02-21-2011 04:10 AM |
Effect of MR Promotion, or not? | ASparrow | Writers' Corner | 51 | 11-26-2010 05:23 PM |