Request: Recipe for ChangeX.de

bowbow · 06-01-2011, 02:55 PM

Hey there,

I am a subscriber of ChangeX.de, and I wonder how to build a recipe for this website:

The site offers RSS for all its articles, but you have to get logged in to view most of its content fully.

Can you help me mixing that recipe?
Thanks!

Here is what I can help:
- Overview of RSS feeds: http://www.changex.de/Page/Feed
- Login-Site: http://www.changex.de/Login
- Content within an article is solely within <div class="nl-online">, you can strip all others divs
- to try out without login, use the partner section: http://www.changex.de/Feed/Partnerforum/RSS20
- Example article page: http://www.changex.de/Article/rezens...r_coachingwahn

bowbow · 06-01-2011, 06:37 PM

I managed to get a decent, very clean recipe for the free section, with the help of the tagesschau.de-recipe:

Code:

class AdvancedUserRecipe1306964283(BasicNewsRecipe):
    title          = u'ChangeX'
    oldest_article = 7
    max_articles_per_feed = 100

    cover_url = 'https://7012901881146393470-a-1802744773732722657-s-sites.googlegroups.com/site/banglabeltze/Home/changex.png?attachauth=ANoY7coFJ1S94rp0tfSsNy40Vkvjz8v2yvVH6ivi5d_wHHwGKbwT9x3wTDGE-SNvpHN9dCG7oC6vEvGFZz7Z75qO5Ho_iXE2_Fr7jqzCBP8kmfRwmGkUlGJMCnQKO52m3u12QHbzEaydSpELKDDc_tKHnOj6OZ-ZRCLuiJYUBM4xYVX43sIh9hvp9mGrlvzPc6mWOYPQAOhmu1p28mLRDOASkEUG9ZZc0w%3D%3D&attredirects=1'


    remove_tags = [
dict(name='div', attrs={'class':['right','optionbox']}),
dict(name='div', attrs={'id':['header','footer']}),
dict(name='a', attrs={'class':['top']}),
]

# entfernen aller hotlinks
    def preprocess_html(self, soup):
        for alink in soup.findAll('a'):
            if alink.string is not None:
               tstr = alink.string
               alink.replaceWith(tstr)
        return soup

    feeds          = [(u'XPartner', u'http://www.changex.de/Feed/Partnerforum/RSS20')]

Any help on the subscription part is highly appreciated, I may also help with a (temporary) login!

Cheers!

bowbow · 06-01-2011, 07:04 PM

Some ideas on the login-procedure:

I have no clue about python programming, but the logic behind the login is the following:

- RSS for all articles: http://www.changex.de/Feed/Home/RSS20
- Leads to first subscribers-only page: http://www.changex.de/Article/report...g_fuer_bildung
- IF not logged in, THEN <div class="subscribers weiterlesen">
- IF div class=subscribers weiterlesen", THEN a, page + ?login, e.g. http://www.changex.de/Article/report..._bildung?login
- WHEN ?login, THEN prompt for username & password AND fill <input id="nutzername" type="text" value="" name="username"> AND fill <input id="passwort" type="password" value="" name="password">
- WHEN filled in, THEN <button class="login-send" type="submit">
- You should now get the page with its full content

May anyone help me translating this into python?

Cheers!

Starson17 · 06-02-2011, 11:40 AM

Quote:

Originally Posted by bowbow

Some ideas on the login-procedure:

I have no clue about python programming, but the logic behind the login is the following:

- RSS for all articles: http://www.changex.de/Feed/Home/RSS20
- Leads to first subscribers-only page: http://www.changex.de/Article/report...g_fuer_bildung
- IF not logged in, THEN <div class="subscribers weiterlesen">
- IF div class=subscribers weiterlesen", THEN a, page + ?login, e.g. http://www.changex.de/Article/report..._bildung?login
- WHEN ?login, THEN prompt for username & password AND fill <input id="nutzername" type="text" value="" name="username"> AND fill <input id="passwort" type="password" value="" name="password">
- WHEN filled in, THEN <button class="login-send" type="submit">
- You should now get the page with its full content

May anyone help me translating this into python?

Cheers!

Look at any of the subscription recipes or here.

AFAICT, you haven't described how the site determines if you are logged in. Usually it's cookies, so the recipe needs to go to the login page before following any of the article links. The link above shows you how to go to the right page, send the user/password and the recipe then sets the correct login cookies and never sees the redirect to the login page that you seem to be describing. If it's something other than cookies (headers, etc.) then you may need more than the basic tools already built in for handling this.

bowbow · 06-02-2011, 05:13 PM

Hey,

thanks for your help. I studied all linked content and could advance a bit.

- Login is managed via cookies (JSESSIONID), so basically your linked solution should work

- I could build a recipe that should do it, yet I have one problem to solve (at least I guess), so here is the setup

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1307036487(BasicNewsRecipe):
    title          = u'ChangeX Subscription'
    oldest_article = 7
    max_articles_per_feed = 100
    needs_subscription = True

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        br.open('http://www.changex.de/')
        if self.username is not None and self.password is not None:
            br.open('http://www.changex.de/Login')
            br.select_form(name='login')
            br['nutzername']   = self.username
            br['passwort'] = self.password
            br.submit()
        return br

    feeds          = [(u'Arbeit und Leben', u'http://www.changex.de/Feed/ArbeitUndLeben/RSS20'), (u'Wirtschaft und Management', u'http://www.changex.de/Feed/WirtschaftUndManagement/RSS20'), (u'Wissen und Lernen', u'http://www.changex.de/Feed/WissenUndLernen/RSS20')]

The problematic section is

Code:

            br.select_form(name='login')

Because I get the error message

Code:

calibre, version 0.8.3
  File "/var/folders/fF/fFUreYAQGJWaokB+Pqrd+U+++TI/-Tmp-/calibre_0.8.3_tmp_doxSSg/calibre_0.8.3_xq8CvS_recipes/recipe0.py", line 20, in get_browser
    br.select_form(name='login')
  File "site-packages/mechanize/_mechanize.py", line 524, in select_form
mechanize._mechanize.FormNotFoundError: no form matching name 'login'

Problem might be, that the original login form on changex.de/login does not have a name. It is marked as follows:

Code:

<div class="leftside">
<h1>changeX Login (JavaScript-frei)</h1>
<form method="post" action="/Login">
<p>
<label for="nutzername">Nutzername:*</label>
<input id="nutzername" type="text" value="" name="username">
</p>
<p>
<label for="passwort">Passwort:*</label>
<input id="passwort" type="password" value="" name="password">
</p>
<p>
<button type="submit">
</p>
</form>
</div>

Unfortunately, I could not find out how mechanize can process that very form, so I am not sure whether my setup would work otherwise.

Do you know how to direct mechanize to a form without a name?

kovidgoyal · 06-02-2011, 05:22 PM

use

select_form(nr=1)

this selects by form number (i.e. the order in which forms appear on the page). Change the number for whatever you page has.

bowbow · 06-03-2011, 08:47 AM

Hey,

that was the decisive tipp, now the recipe works. Could submit that one to the repository. Just one question left, find it below the code.

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1307036487(BasicNewsRecipe):
    title          = u'ChangeX Subscription'
    oldest_article = 7
    max_articles_per_feed = 100
    needs_subscription = True

    cover_url = 'https://7012901881146393470-a-1802744773732722657-s-sites.googlegroups.com/site/banglabeltze/Home/changex.png?attachauth=ANoY7coFJ1S94rp0tfSsNy40Vkvjz8v2yvVH6ivi5d_wHHwGKbwT9x3wTDGE-SNvpHN9dCG7oC6vEvGFZz7Z75qO5Ho_iXE2_Fr7jqzCBP8kmfRwmGkUlGJMCnQKO52m3u12QHbzEaydSpELKDDc_tKHnOj6OZ-ZRCLuiJYUBM4xYVX43sIh9hvp9mGrlvzPc6mWOYPQAOhmu1p28mLRDOASkEUG9ZZc0w%3D%3D&attredirects=1'


    remove_tags = [
dict(name='div', attrs={'class':['right','optionbox','center']}),
dict(name='div', attrs={'id':['header','footer']}),
dict(name='a', attrs={'class':['top']}),
]

# entfernen aller hotlinks
    def preprocess_html(self, soup):
        for alink in soup.findAll('a'):
            if alink.string is not None:
               tstr = alink.string
               alink.replaceWith(tstr)
        return soup

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        br.open('http://www.changex.de/')
        if self.username is not None and self.password is not None:
            br.open('http://www.changex.de/Login')
            br.select_form(nr=1)
            br['username']   = self.username
#            br['nutzername']   = self.username
            br['password'] = self.password
#            br['passwort'] = self.password
            br.submit()
        return br

    feeds          = [(u'Arbeit und Leben', u'http://www.changex.de/Feed/ArbeitUndLeben/RSS20'), (u'Wirtschaft und Management', u'http://www.changex.de/Feed/WirtschaftUndManagement/RSS20'), (u'Wissen und Lernen', u'http://www.changex.de/Feed/WissenUndLernen/RSS20')]

Just one question left:
- I wanted to create one archive of past articles first, so I set the time as follows:

Code:

    oldest_article = 900
    max_articles_per_feed = 1000

When I open the RSS (e.g. http://www.changex.de/Feed/ArbeitUndLeben/RSS20)
with Google Reader I get a bunch of articles, reaching back to 01.01.2010. However, Calibre just processes data back to Nov 23 2010, leaving aside even feeds from that very same day. That seems very odd to me.

I know recipes are meant only for frequent downloading.
Do you nonetheless have an idea how to correctly get all articles from that feed?

Thanks so much for your support!

kovidgoyal · 06-03-2011, 11:02 AM

rss feeds only publish a few articles at a time. google reader caches those giving you the illusion that they are all there.

bowbow · 06-03-2011, 11:15 AM

Quote:

Originally Posted by kovidgoyal

rss feeds only publish a few articles at a time. google reader caches those giving you the illusion that they are all there.

And Calibre cannot cache those feeds, right? Is there any work around to fetch older feed articles - either within or beyond Calibre?

kovidgoyal · 06-03-2011, 11:17 AM

Find a source that has those articles and write a recipe with a custom rparse_index method to get them.

06-01-2011, 02:55 PM	#1
bowbow Member Posts: 10 Karma: 10 Join Date: Jun 2011 Device: kindle 3	Request: Recipe for ChangeX.de Hey there, I am a subscriber of ChangeX.de, and I wonder how to build a recipe for this website: The site offers RSS for all its articles, but you have to get logged in to view most of its content fully. Can you help me mixing that recipe? Thanks! Here is what I can help: - Overview of RSS feeds: http://www.changex.de/Page/Feed - Login-Site: http://www.changex.de/Login - Content within an article is solely within <div class="nl-online">, you can strip all others divs - to try out without login, use the partner section: http://www.changex.de/Feed/Partnerforum/RSS20 - Example article page: http://www.changex.de/Article/rezens...r_coachingwahn

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Request: for recipe	deppeler	Recipes	4	01-12-2011 09:29 AM
recipe request	Torx	Recipes	0	12-20-2010 08:33 AM
Request for recipe	exdream	Calibre	3	04-24-2010 10:13 AM
Request for Recipe	ddavtian	Calibre	2	11-24-2008 02:43 AM
Request for Recipe	girlperson1	Calibre	2	11-14-2008 07:59 AM

06-01-2011, 07:04 PM	#3
bowbow Member Posts: 10 Karma: 10 Join Date: Jun 2011 Device: kindle 3	Some ideas on the login-procedure: I have no clue about python programming, but the logic behind the login is the following: - RSS for all articles: http://www.changex.de/Feed/Home/RSS20 - Leads to first subscribers-only page: http://www.changex.de/Article/report...g_fuer_bildung - IF not logged in, THEN <div class="subscribers weiterlesen"> - IF div class=subscribers weiterlesen", THEN a, page + ?login, e.g. http://www.changex.de/Article/report..._bildung?login - WHEN ?login, THEN prompt for username & password AND fill <input id="nutzername" type="text" value="" name="username"> AND fill <input id="passwort" type="password" value="" name="password"> - WHEN filled in, THEN <button class="login-send" type="submit"> - You should now get the page with its full content May anyone help me translating this into python? Cheers!

06-02-2011, 05:22 PM	#6
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	use select_form(nr=1) this selects by form number (i.e. the order in which forms appear on the page). Change the number for whatever you page has.

06-03-2011, 11:02 AM	#8
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	rss feeds only publish a few articles at a time. google reader caches those giving you the illusion that they are all there.

06-03-2011, 11:17 AM	#10
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Find a source that has those articles and write a recipe with a custom rparse_index method to get them.

Advert

Advert