My SZ Recipe does not fetch all articles

Sushi5675 · 03-10-2023, 05:21 AM

Hi there,

I am very frustrated right now and really hope someone can help me out here.

I am trying to fetch news from www.sueddeutsche.de which works fine for some articles but for others, it does not. The paragraph is somehow hidden in the html code and doesnt get extracted.

I have a subscription so the articles should be visible even though they are behind a paywall. But the fetching process doesnt work only on some articles regardless whether they are behind a paywall or not.

This article is for example not working:

view-source:https://www.sueddeutsche.de/politik/...215?print=true

I use the print=true tag because it is much cleaner then...

I am really looking forward to any idea or code example.
If we can figure that out here I'd be happy to share the recipe via calibre because the last recipes I found are quite old...

Thank you!!

PHP Code:


			
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'

#import
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime

##SZ
class Sueddeutsche(BasicNewsRecipe):
    title = u'SZ8'
    description = 'News from Germany'
    publisher = u'Süddeutsche Zeitung'
    category = 'news, politics'
    timefmt = ' [%a, %d %b %Y]'
    oldest_article = 1
    max_articles_per_feed = 10
    language = 'de'
    encoding = 'utf-8'
    publication_type = 'newspaper'
    remove_empty_feeds = True
    needs_subscription = True
    use_embedded_content = False
    no_stylesheets = True
    remove_javascript = False
    auto_cleanup = True
    #simultaneous_downloads = 1
    #articles_are_obfuscated = True

    
    #add login

    def get_browser(self):
        browser = BasicNewsRecipe.get_browser(self)
        # Login
        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)
        browser.select_form(nr=0)  # first form
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()
        return browser

    feeds = [  
        (u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'),
    ]
    
    
    def print_version(self, url):
        return url + '?print=true'

unkn0wn · 03-10-2023, 01:11 PM

maybe auto_cleanup fails.

You can check just by removing auto_cleanup, It'll load everyrthing from the page.. maybe then you'll know if the fetched link itself doesn't actually have any content.. or if its a login issue.

Sushi5675 · 03-10-2023, 05:13 PM

Quote:

Originally Posted by unkn0wn

maybe auto_cleanup fails.

You can check just by removing auto_cleanup, It'll load everyrthing from the page.. maybe then you'll know if the fetched link itself doesn't actually have any content.. or if its a login issue.

Thank you for your reply. Unfortunately it doesnt work with auto_cleanup = false or removed.

I've rechecked again and testet via console ouput and debugging mode.

And now it seems to be a login issue after all.

But what is wrong with the "def get_browser(self)" section?

Sushi5675 · 03-19-2023, 04:19 AM

H,

unfortunately I am not able to resolve the issue.

When I enter the login data on
https://id.sueddeutsche.de/login
end press enter -> it does not login automatically. I have to klick on the Login Button.

In addition, in the settings of my SZ profile I can see my logged in sessions, but not the browser from calibre news receipe.

Does that mean, that the browser.submit() function is probably also not working and I am not logged in after all?

Is there an alternative to browser.submit() function?

Here is the form of id.sueddeutsche.de/login

Code:

<div id="loginbox">
                     <form class="top-boxes" id="login-form" method="post" role="form" action="/login"><div class="form-group floating-label js-required"><label for="id_login">E-Mail Adresse</label><input type="text" name="login" id="login_login-form" class="form-control" /></div><div class="form-group floating-label js-required"><label for="id_password">Passwort</label><input type="password" name="password" id="password_login-form" class="form-control" /><div class="field-help help"><a href="&#x2F;resetpassword">Passwort vergessen</a></div></div><div class="form-group rememberme checkbox-group"><div class="table-box"><div class="custom-checkbox"><input type="checkbox" name="remember_me" id="id_remember_me" value="on" class="form-control" checked="checked" /><div class="box"><div class="tick"></div></div></div><div class="label-box"><label for="id_remember_me">Angemeldet bleiben</label></div></div></div><div class="form-group hidden"><input type="hidden" name="login_ticket" id="login_ticket_login-form" value="LT-l0wVOXDqTgF9GfUzQhy7HuN63LIni" /></div><div id="creTracking-login"></div>

Sushi5675 · 03-19-2023, 04:22 AM

Hi,

unfortunately my recipe does not work and I cant figure out how to solve it.

When I login manually on url = https://id.sueddeutsche.de/login and I press Enter after filling in the fields, nothing happens.

Maybe the browser.submit() function is also not working?
Is there an alternative to submit() to login with the browser session with the news recipe?

Sushi5675 · 03-19-2023, 04:24 AM

Also, in my SZ profile I can see all my logged in devices.
But the browser session of calibre is not visible so I assume my login does not work.

unkn0wn · 03-21-2023, 12:03 PM

try

Code:

    def get_browser(self):
        
        def is_form_login(form):
            return "id" in form.attrs and form.attrs['id'] == "login-form"

        browser = BasicNewsRecipe.get_browser(self)
        # Login
        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)
        browser.select_form(predicate=is_form_login)
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()
        return browser

Sushi5675 · 04-11-2023, 01:05 PM

Quote:

Originally Posted by unkn0wn

try

Code:

    def get_browser(self):
        
        def is_form_login(form):
            return "id" in form.attrs and form.attrs['id'] == "login-form"

        browser = BasicNewsRecipe.get_browser(self)
        # Login
        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)
        browser.select_form(predicate=is_form_login)
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()
        return browser

Thanks unkn0wn, notification of your post didnt work so please excuse my late reply.

I debugged the the output and the login works.
In the console output I can read my profile ID, which is only visible after successful login.

But unfortunately only two or three articles are readable.

The strange thing is, that some articles behind the paywall are readable and others are not. The rest of the articles are reduced.

Any ideas?

Code:

# -*- coding: utf-8 -*-
__license__ = 'GPL v3'

#import
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime
import time


##SZ
class Sueddeutsche(BasicNewsRecipe):
    title = u'SZ8'
    description = 'News from Germany'
    publisher = u'Süddeutsche Zeitung'
    category = 'news, politics'
    timefmt = ' [%a, %d %b %Y]'
    oldest_article = 1
    max_articles_per_feed = 10
    language = 'de'
    encoding = 'utf-8'
    publication_type = 'newspaper'
    remove_empty_feeds = True
    needs_subscription = True

    
    simultaneous_downloads = 1
    recursions = 0

    feeds = [  
        #(u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'),
        
        (u'SZ', u'https://www.sueddeutsche.de/news/rss?		search=&sort=date&dep%5B%5D=politik&typ%5B%5D=article&all%5B%5D=sys&all%5B%5D=time&sys%5B%5D=sz&catsz%5B%5D=szTopThemes'), 
    ]
    
    def get_browser(self):
            def is_form_login(form):
                return "id" in form.attrs and form.attrs['id'] == "login-form"
            browser = BasicNewsRecipe.get_browser(self)
            # Login
            url = 'https://id.sueddeutsche.de/login'
            browser.open(url)
            browser.select_form(predicate=is_form_login)
            #browser.select_form(nr=0)  # first form
            browser['login'] = self.username
            browser['password'] = self.password
            browser.submit()
            return browser
    
    def print_version(self, url):
            if '?' in url:
                new_url = self.browser.open(url + '&print=true').geturl()
            else: 
                new_url = self.browser.open(url + '?print=true').geturl()
            return new_url

unkn0wn · 04-12-2023, 01:34 AM

maybe don't use print_version part.
check once.
If it works, you can add auto_cleanup = True.

Why is sz feed link so long?
just use (u'SZ', u'https://www.sueddeutsche.de/news/rss'),

Sushi5675 · 05-02-2023, 01:36 PM

Sorry again for my late reply.

I've tested again in any possible way but the problem persists.

The feed I was using reduces the amount of articles to a specific kind and source. But the download of articles that are restricted and not simple dpa news still dont work.

Any other suggestion?

unkn0wn · 05-03-2023, 02:20 AM

Pm me your login details. Attach the recipe, I can check.

Sushi5675 · 05-21-2023, 02:23 AM

Hi,

i still dont get it to work... Thanks @unkn0wn for all your input.

The initial login procedure works, but probably it's not staying logged in (without javascript?). Maybe we need something similar to wsj or irish times recipes?

Current status is:

Code:

# -*- coding: utf-8 -*-
__license__ = 'GPL v3'

'''
Fetch sueddeutsche.de
'''
from calibre.web.feeds.news import BasicNewsRecipe, classes

class Sueddeutsche(BasicNewsRecipe):

    title = u'SZ'
    description = 'News from Germany, Access to online content'
    publisher = u'Süddeutsche Zeitung'
    category = 'news, politics, Germany'
    timefmt = ' [%a, %d %b %Y]'
    oldest_article = 1
    max_articles_per_feed = 100
    language = 'de'
    encoding = 'utf-8'
    publication_type = 'newspaper'
    remove_attributes = ['style', 'height', 'width']
    needs_subscription = True
    use_embedded_content = False
    no_stylesheets = True
    
    def get_browser(self):
        
        def is_form_login(form):
            return "id" in form.attrs and form.attrs['id'] == "login-form"
        
        browser = BasicNewsRecipe.get_browser(self)

        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)

        browser.select_form(predicate=is_form_login)
        #browser.select_form(nr=0)  
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()

        return browser
    
    keep_only_tags = [
        classes('lp_is_start custom-1qvpywd')
    ]
    
    remove_tags = [
        dict(name=['button', 'aside', 'nav']),
        classes('teaserable-layout teaserable-layout--teaser')
    ]

    feeds = [	
         (u'SZ', u'https://www.sueddeutsche.de/news/rss'),       
    ]
    
    def preprocess_html(self, soup):
        for pic in soup.findAll('picture'):
            if nos := pic.find('noscript'):
                nos.name = 'div'
        for img in soup.findAll('img', attrs={'src':lambda n: n and n.startswith('data:')}):
            img.extract()
        return soup
    
    def print_version(self, url):
        return url.split('?')[0]

03-10-2023, 05:21 AM	#1
Sushi5675 Junior Member Posts: 8 Karma: 10 Join Date: Mar 2023 Device: kindle paperwhite	My SZ Recipe does not fetch all articles Hi there, I am very frustrated right now and really hope someone can help me out here. I am trying to fetch news from www.sueddeutsche.de which works fine for some articles but for others, it does not. The paragraph is somehow hidden in the html code and doesnt get extracted. I have a subscription so the articles should be visible even though they are behind a paywall. But the fetching process doesnt work only on some articles regardless whether they are behind a paywall or not. This article is for example not working: view-source:https://www.sueddeutsche.de/politik/...215?print=true I use the print=true tag because it is much cleaner then... I am really looking forward to any idea or code example. If we can figure that out here I'd be happy to share the recipe via calibre because the last recipes I found are quite old... Thank you!! PHP Code: # -- coding: utf-8 -- __license__ = 'GPL v3' #import from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup from calibre import strftime ##SZ class Sueddeutsche(BasicNewsRecipe): title = u'SZ8' description = 'News from Germany' publisher = u'Süddeutsche Zeitung' category = 'news, politics' timefmt = ' [%a, %d %b %Y]' oldest_article = 1 max_articles_per_feed = 10 language = 'de' encoding = 'utf-8' publication_type = 'newspaper' remove_empty_feeds = True needs_subscription = True use_embedded_content = False no_stylesheets = True remove_javascript = False auto_cleanup = True #simultaneous_downloads = 1 #articles_are_obfuscated = True #add login def get_browser(self): browser = BasicNewsRecipe.get_browser(self) # Login url = 'https://id.sueddeutsche.de/login' browser.open(url) browser.select_form(nr=0) # first form browser['login'] = self.username browser['password'] = self.password browser.submit() return browser feeds = [ (u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'), ] def print_version(self, url): return url + '?print=true'

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe for single articles	aschiller	Recipes	1	11-07-2019 03:31 AM
Failed to fetch multipage articles	Susa	Recipes	2	03-25-2019 12:49 AM
How to fetch articles from infinite scrolling page	Ramana	Recipes	2	12-07-2018 07:22 AM
Fetch News for The Wall Street Journal (En) is not downloading it's articles	Brookings	Recipes	0	09-04-2014 04:36 AM
Fetch Recipe as PDF	Jim77	Calibre	12	12-29-2010 09:07 AM

03-10-2023, 01:11 PM	#2
unkn0wn Evangelist Posts: 444 Karma: 82686 Join Date: May 2021 Device: kindle	maybe auto_cleanup fails. You can check just by removing auto_cleanup, It'll load everyrthing from the page.. maybe then you'll know if the fetched link itself doesn't actually have any content.. or if its a login issue.

03-19-2023, 04:22 AM	#5
Sushi5675 Junior Member Posts: 8 Karma: 10 Join Date: Mar 2023 Device: kindle paperwhite	Hi, unfortunately my recipe does not work and I cant figure out how to solve it. When I login manually on url = https://id.sueddeutsche.de/login and I press Enter after filling in the fields, nothing happens. Maybe the browser.submit() function is also not working? Is there an alternative to submit() to login with the browser session with the news recipe?

03-19-2023, 04:24 AM	#6
Sushi5675 Junior Member Posts: 8 Karma: 10 Join Date: Mar 2023 Device: kindle paperwhite	Also, in my SZ profile I can see all my logged in devices. But the browser session of calibre is not visible so I assume my login does not work.

04-12-2023, 01:34 AM	#9
unkn0wn Evangelist Posts: 444 Karma: 82686 Join Date: May 2021 Device: kindle	maybe don't use print_version part. check once. If it works, you can add auto_cleanup = True. Why is sz feed link so long? just use (u'SZ', u'https://www.sueddeutsche.de/news/rss'),

05-02-2023, 01:36 PM	#10
Sushi5675 Junior Member Posts: 8 Karma: 10 Join Date: Mar 2023 Device: kindle paperwhite	Sorry again for my late reply. I've tested again in any possible way but the problem persists. The feed I was using reduces the amount of articles to a specific kind and source. But the download of articles that are restricted and not simple dpa news still dont work. Any other suggestion?

05-03-2023, 02:20 AM	#11
unkn0wn Evangelist Posts: 444 Karma: 82686 Join Date: May 2021 Device: kindle	Pm me your login details. Attach the recipe, I can check.