|
|
#1 |
|
Enthusiast
![]() Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
|
getting rid of images: remove_tags has no effect?
I am using the builtin recipe to acquire the rss from sueddeutsche.de. The recipe is very useful. However, I found myself unable to the articles from any graphical elements. I thought this could be achieved by inspecting the sites html and adding the specific elements to the remove_tags-list. However, this does not help. My workaround was to make the unwanted elements invisible using the extra_css option. Still, no sucess. Did I make some sort of gigantic mistake? After tinkering with the recipe for days and having learned quite a lot of useful things, I am stuck. I humbly request your help in this matter.
In order to get a better understandig of the problem, please refer to the following sources: An example of an article that I want to strip the picture (along with its caption) from: http://www.sueddeutsche.de/politik/2...tion-1.1392387 The recipe I am using. It is based on the builtin one called "Süddeutsche.de" and unfortunately includes all the images I want to get rid of: Code:
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'
__copyright__ = '2012, Kovid Goyal <kovid at kovidgoyal.net>' # 2012-01-26 AGe change to actual Year
'''
Fetch sueddeutsche.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
class Sueddeutsche(BasicNewsRecipe):
title = u'Süddeutsche.de' # 2012-01-26 AGe Correct Title
description = 'News from Germany, Access to online content' # 2012-01-26 AGe
__author__ = 'Oliver Niesner and Armin Geller' #Update AGe 2012-01-26
publisher = u'Süddeutsche Zeitung' # 2012-01-26 AGe add
category = 'news, politics, Germany' # 2012-01-26 AGe add
timefmt = ' [%a, %d %b %Y]' # 2012-01-26 AGe add %a
oldest_article = 2
max_articles_per_feed = 100
simultaneous_downloads = 75
language = 'de'
encoding = 'utf-8'
publication_type = 'newspaper' # 2012-01-26 add
cover_source = 'http://www.sueddeutsche.de/verlag' # 2012-01-26 AGe add from Darko Miletic paid content source
masthead_url = 'http://www.sueddeutsche.de/static_assets/build/img/sdesiteheader/logo_homepage.441d531c.png' # 2012-01-26 AGe add
use_embedded_content = False
no_stylesheets = True
remove_javascript = True
auto_cleanup = True
feeds = [
(u'Politik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EPolitik%24?output=rss'),
(u'Wirtschaft', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWirtschaft%24?output=rss'),
(u'Geld', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EGeld%24?output=rss'),
(u'Kultur', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKultur%24?output=rss'),
(u'Leben', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ELeben%24?output=rss'),
(u'Karriere', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKarriere%24?output=rss'),
(u'Bildung', u'http://rss.sueddeutsche.de/rss/bildung'), #2012-01-26 AGe New
(u'Gesundheit', u'http://rss.sueddeutsche.de/rss/gesundheit'), #2012-01-26 AGe New
(u'Medien', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EMedien%24?output=rss'),
(u'Digital', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EDigital%24?output=rss'),
(u'Auto', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EAuto%24?output=rss'),
(u'Wissen', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWissen%24?output=rss'),
(u'Reise', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EReise%24?output=rss'),
(u'Technik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ETechnik%24?output=rss'), # sometimes only
]
# AGe 2011-12-16 Problem of Handling redirections solved by a solution of Recipes-Re-usable code from kiklop74.
# Feed is: http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ESport%24?output=rss
# Article download source is: http://sz.de/1.1237295 (Ski Alpin: Der Erfolg kommt, der Trainer geht)
# Article source is: http://www.sueddeutsche.de/sport/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
# Article printversion is: http://www.sueddeutsche.de/sport/2.220/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
def print_version(self, url):
n_url=self.browser.open_novisit(url).geturl()
main, sep, id = n_url.rpartition('/')
return main + '/2.220/' + id
remove_tags = [dict(name='img'), dict(name='figure'), dict(name='div', attrs={'class':["headslot"] })]
extra_css = 'figure, img, .headslot, .zoomable{display:none;}'
|
|
|
|
|
|
#2 |
|
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
|
Your indentation is wrong. You need to indent remove_tags and extra_css four spaces so they become class members.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Enthusiast
![]() Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
|
Problem solved
Thank you, that was the problem. The most basic of mistakes, it seems
|
|
|
|
![]() |
| Tags |
| recipe |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Priority between keep_only_tags and remove_tags | BruceBerry | Recipes | 1 | 11-19-2011 04:10 PM |
| Affect or effect | mr ploppy | Writers' Corner | 6 | 07-20-2011 05:00 PM |
| remove_tags does not work | JFS-NMF | Recipes | 1 | 03-04-2011 02:56 PM |
| Help Please: remove_tags doesn't work in WSJ Chinese | Jmot | Recipes | 5 | 02-21-2011 05:10 AM |
| Effect of MR Promotion, or not? | ASparrow | Writers' Corner | 51 | 11-26-2010 06:23 PM |