MobileRead Forums - View Single Post

Sushi5675 · 03-10-2023, 05:21 AM

Hi there,

I am very frustrated right now and really hope someone can help me out here.

I am trying to fetch news from www.sueddeutsche.de which works fine for some articles but for others, it does not. The paragraph is somehow hidden in the html code and doesnt get extracted.

I have a subscription so the articles should be visible even though they are behind a paywall. But the fetching process doesnt work only on some articles regardless whether they are behind a paywall or not.

This article is for example not working:

view-source:https://www.sueddeutsche.de/politik/...215?print=true

I use the print=true tag because it is much cleaner then...

I am really looking forward to any idea or code example.
If we can figure that out here I'd be happy to share the recipe via calibre because the last recipes I found are quite old...

Thank you!!

PHP Code:


			
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'

#import
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime

##SZ
class Sueddeutsche(BasicNewsRecipe):
    title = u'SZ8'
    description = 'News from Germany'
    publisher = u'Süddeutsche Zeitung'
    category = 'news, politics'
    timefmt = ' [%a, %d %b %Y]'
    oldest_article = 1
    max_articles_per_feed = 10
    language = 'de'
    encoding = 'utf-8'
    publication_type = 'newspaper'
    remove_empty_feeds = True
    needs_subscription = True
    use_embedded_content = False
    no_stylesheets = True
    remove_javascript = False
    auto_cleanup = True
    #simultaneous_downloads = 1
    #articles_are_obfuscated = True

    
    #add login

    def get_browser(self):
        browser = BasicNewsRecipe.get_browser(self)
        # Login
        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)
        browser.select_form(nr=0)  # first form
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()
        return browser

    feeds = [  
        (u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'),
    ]
    
    
    def print_version(self, url):
        return url + '?print=true'

03-10-2023, 05:21 AM	#1
Sushi5675 Junior Member Posts: 8 Karma: 10 Join Date: Mar 2023 Device: kindle paperwhite	My SZ Recipe does not fetch all articles Hi there, I am very frustrated right now and really hope someone can help me out here. I am trying to fetch news from www.sueddeutsche.de which works fine for some articles but for others, it does not. The paragraph is somehow hidden in the html code and doesnt get extracted. I have a subscription so the articles should be visible even though they are behind a paywall. But the fetching process doesnt work only on some articles regardless whether they are behind a paywall or not. This article is for example not working: view-source:https://www.sueddeutsche.de/politik/...215?print=true I use the print=true tag because it is much cleaner then... I am really looking forward to any idea or code example. If we can figure that out here I'd be happy to share the recipe via calibre because the last recipes I found are quite old... Thank you!! PHP Code: # -- coding: utf-8 -- __license__ = 'GPL v3' #import from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup from calibre import strftime ##SZ class Sueddeutsche(BasicNewsRecipe): title = u'SZ8' description = 'News from Germany' publisher = u'Süddeutsche Zeitung' category = 'news, politics' timefmt = ' [%a, %d %b %Y]' oldest_article = 1 max_articles_per_feed = 10 language = 'de' encoding = 'utf-8' publication_type = 'newspaper' remove_empty_feeds = True needs_subscription = True use_embedded_content = False no_stylesheets = True remove_javascript = False auto_cleanup = True #simultaneous_downloads = 1 #articles_are_obfuscated = True #add login def get_browser(self): browser = BasicNewsRecipe.get_browser(self) # Login url = 'https://id.sueddeutsche.de/login' browser.open(url) browser.select_form(nr=0) # first form browser['login'] = self.username browser['password'] = self.password browser.submit() return browser feeds = [ (u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'), ] def print_version(self, url): return url + '?print=true'