Hi there,
I am very frustrated right now and really hope someone can help me out here.
I am trying to fetch news from
www.sueddeutsche.de which works fine for some articles but for others, it does not. The paragraph is somehow hidden in the html code and doesnt get extracted.
I have a subscription so the articles should be visible even though they are behind a paywall. But the fetching process doesnt work only on some articles regardless whether they are behind a paywall or not.
This article is for example not working:
view-source:
https://www.sueddeutsche.de/politik/...215?print=true
I use the print=true tag because it is much cleaner then...
I am really looking forward to any idea or code example.
If we can figure that out here I'd be happy to share the recipe via calibre because the last recipes I found are quite old...
Thank you!!
PHP Code:
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'
#import
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime
##SZ
class Sueddeutsche(BasicNewsRecipe):
title = u'SZ8'
description = 'News from Germany'
publisher = u'Süddeutsche Zeitung'
category = 'news, politics'
timefmt = ' [%a, %d %b %Y]'
oldest_article = 1
max_articles_per_feed = 10
language = 'de'
encoding = 'utf-8'
publication_type = 'newspaper'
remove_empty_feeds = True
needs_subscription = True
use_embedded_content = False
no_stylesheets = True
remove_javascript = False
auto_cleanup = True
#simultaneous_downloads = 1
#articles_are_obfuscated = True
#add login
def get_browser(self):
browser = BasicNewsRecipe.get_browser(self)
# Login
url = 'https://id.sueddeutsche.de/login'
browser.open(url)
browser.select_form(nr=0) # first form
browser['login'] = self.username
browser['password'] = self.password
browser.submit()
return browser
feeds = [
(u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'),
]
def print_version(self, url):
return url + '?print=true'