Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-25-2012, 08:50 AM   #1
Read&Write
Enthusiast
Read&Write has learned how to buy an e-book online
 
Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
getting rid of images: remove_tags has no effect?

I am using the builtin recipe to acquire the rss from sueddeutsche.de. The recipe is very useful. However, I found myself unable to the articles from any graphical elements. I thought this could be achieved by inspecting the sites html and adding the specific elements to the remove_tags-list. However, this does not help. My workaround was to make the unwanted elements invisible using the extra_css option. Still, no sucess. Did I make some sort of gigantic mistake? After tinkering with the recipe for days and having learned quite a lot of useful things, I am stuck. I humbly request your help in this matter.

In order to get a better understandig of the problem, please refer to the following sources:

An example of an article that I want to strip the picture (along with its caption) from:
http://www.sueddeutsche.de/politik/2...tion-1.1392387

The recipe I am using. It is based on the builtin one called "Süddeutsche.de" and unfortunately includes all the images I want to get rid of:

Code:
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2012, Kovid Goyal <kovid at kovidgoyal.net>' # 2012-01-26 AGe change to actual Year

'''
Fetch sueddeutsche.de
'''
from calibre.web.feeds.news import BasicNewsRecipe
class Sueddeutsche(BasicNewsRecipe):

    title                 = u'Süddeutsche.de'                 # 2012-01-26 AGe Correct Title
    description           = 'News from Germany, Access to online content' # 2012-01-26 AGe
    __author__            = 'Oliver Niesner and Armin Geller' #Update AGe 2012-01-26
    publisher             = u'Süddeutsche Zeitung'             # 2012-01-26 AGe add
    category              = 'news, politics, Germany'         # 2012-01-26 AGe add
    timefmt               = ' [%a, %d %b %Y]'                 # 2012-01-26 AGe add %a
    oldest_article        = 2
    max_articles_per_feed = 100
    simultaneous_downloads = 75
    language              = 'de'
    encoding              = 'utf-8'
    publication_type      = 'newspaper'                         # 2012-01-26 add
    cover_source          = 'http://www.sueddeutsche.de/verlag' # 2012-01-26 AGe add from Darko Miletic paid content source
    masthead_url          = 'http://www.sueddeutsche.de/static_assets/build/img/sdesiteheader/logo_homepage.441d531c.png' # 2012-01-26 AGe add

    use_embedded_content  = False
    no_stylesheets        = True
    remove_javascript     = True
    auto_cleanup          = True

    feeds = [
              (u'Politik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EPolitik%24?output=rss'),
              (u'Wirtschaft', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWirtschaft%24?output=rss'),
              (u'Geld', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EGeld%24?output=rss'),
              (u'Kultur', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKultur%24?output=rss'),
              (u'Leben', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ELeben%24?output=rss'),
              (u'Karriere', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKarriere%24?output=rss'),
              (u'Bildung', u'http://rss.sueddeutsche.de/rss/bildung'),         #2012-01-26 AGe New
              (u'Gesundheit', u'http://rss.sueddeutsche.de/rss/gesundheit'),   #2012-01-26 AGe New
              (u'Medien', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EMedien%24?output=rss'),
              (u'Digital', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EDigital%24?output=rss'),
              (u'Auto', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EAuto%24?output=rss'),
              (u'Wissen', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EWissen%24?output=rss'),
              (u'Reise', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EReise%24?output=rss'),
              (u'Technik', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ETechnik%24?output=rss'), # sometimes only
            ]
# AGe 2011-12-16 Problem of Handling redirections solved by a solution of Recipes-Re-usable code from kiklop74.
# Feed is:                    http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ESport%24?output=rss
# Article download source is: http://sz.de/1.1237295 (Ski Alpin: Der Erfolg kommt, der Trainer geht)
# Article source is:          http://www.sueddeutsche.de/sport/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
# Article printversion is:    http://www.sueddeutsche.de/sport/2.220/ski-alpin-der-erfolg-kommt-der-trainer-geht-1.1237295
    def print_version(self, url):
        n_url=self.browser.open_novisit(url).geturl()
        main, sep, id = n_url.rpartition('/')
        return main + '/2.220/' + id

remove_tags = [dict(name='img'), dict(name='figure'), dict(name='div', attrs={'class':["headslot"] })]
extra_css = 'figure, img, .headslot, .zoomable{display:none;}'
I would be grateful for your help. I would definitely like to learn about this issue so I can avoid the problem in the future.
Read&Write is offline   Reply With Quote
Old 06-25-2012, 02:59 PM   #2
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
Your indentation is wrong. You need to indent remove_tags and extra_css four spaces so they become class members.
NotTaken is offline   Reply With Quote
Old 06-26-2012, 02:27 PM   #3
Read&Write
Enthusiast
Read&Write has learned how to buy an e-book online
 
Posts: 26
Karma: 86
Join Date: Jun 2012
Device: Onyx M92
Problem solved

Thank you, that was the problem. The most basic of mistakes, it seems
Read&Write is offline   Reply With Quote
Reply

Tags
recipe

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Priority between keep_only_tags and remove_tags BruceBerry Recipes 1 11-19-2011 04:10 PM
Affect or effect mr ploppy Writers' Corner 6 07-20-2011 05:00 PM
remove_tags does not work JFS-NMF Recipes 1 03-04-2011 02:56 PM
Help Please: remove_tags doesn't work in WSJ Chinese Jmot Recipes 5 02-21-2011 05:10 AM
Effect of MR Promotion, or not? ASparrow Writers' Corner 51 11-26-2010 06:23 PM


All times are GMT -4. The time now is 10:38 PM.


MobileRead.com is a privately owned, operated and funded community.