Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-01-2011, 02:55 PM   #1
bowbow
Member
bowbow began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
Request: Recipe for ChangeX.de

Hey there,

I am a subscriber of ChangeX.de, and I wonder how to build a recipe for this website:

The site offers RSS for all its articles, but you have to get logged in to view most of its content fully.

Can you help me mixing that recipe?
Thanks!

Here is what I can help:
- Overview of RSS feeds: http://www.changex.de/Page/Feed
- Login-Site: http://www.changex.de/Login
- Content within an article is solely within <div class="nl-online">, you can strip all others divs
- to try out without login, use the partner section: http://www.changex.de/Feed/Partnerforum/RSS20
- Example article page: http://www.changex.de/Article/rezens...r_coachingwahn
bowbow is offline   Reply With Quote
Old 06-01-2011, 06:37 PM   #2
bowbow
Member
bowbow began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
I managed to get a decent, very clean recipe for the free section, with the help of the tagesschau.de-recipe:

Code:
class AdvancedUserRecipe1306964283(BasicNewsRecipe):
    title          = u'ChangeX'
    oldest_article = 7
    max_articles_per_feed = 100

    cover_url = 'https://7012901881146393470-a-1802744773732722657-s-sites.googlegroups.com/site/banglabeltze/Home/changex.png?attachauth=ANoY7coFJ1S94rp0tfSsNy40Vkvjz8v2yvVH6ivi5d_wHHwGKbwT9x3wTDGE-SNvpHN9dCG7oC6vEvGFZz7Z75qO5Ho_iXE2_Fr7jqzCBP8kmfRwmGkUlGJMCnQKO52m3u12QHbzEaydSpELKDDc_tKHnOj6OZ-ZRCLuiJYUBM4xYVX43sIh9hvp9mGrlvzPc6mWOYPQAOhmu1p28mLRDOASkEUG9ZZc0w%3D%3D&attredirects=1'


    remove_tags = [
dict(name='div', attrs={'class':['right','optionbox']}),
dict(name='div', attrs={'id':['header','footer']}),
dict(name='a', attrs={'class':['top']}),
]

# entfernen aller hotlinks
    def preprocess_html(self, soup):
        for alink in soup.findAll('a'):
            if alink.string is not None:
               tstr = alink.string
               alink.replaceWith(tstr)
        return soup

    feeds          = [(u'XPartner', u'http://www.changex.de/Feed/Partnerforum/RSS20')]

Any help on the subscription part is highly appreciated, I may also help with a (temporary) login!

Cheers!
bowbow is offline   Reply With Quote
Advert
Old 06-01-2011, 07:04 PM   #3
bowbow
Member
bowbow began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
Some ideas on the login-procedure:

I have no clue about python programming, but the logic behind the login is the following:

- RSS for all articles: http://www.changex.de/Feed/Home/RSS20
- Leads to first subscribers-only page: http://www.changex.de/Article/report...g_fuer_bildung
- IF not logged in, THEN <div class="subscribers weiterlesen">
- IF div class=subscribers weiterlesen", THEN a, page + ?login, e.g. http://www.changex.de/Article/report..._bildung?login
- WHEN ?login, THEN prompt for username & password AND fill <input id="nutzername" type="text" value="" name="username"> AND fill <input id="passwort" type="password" value="" name="password">
- WHEN filled in, THEN <button class="login-send" type="submit">
- You should now get the page with its full content

May anyone help me translating this into python?

Cheers!
bowbow is offline   Reply With Quote
Old 06-02-2011, 11:40 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by bowbow View Post
Some ideas on the login-procedure:

I have no clue about python programming, but the logic behind the login is the following:

- RSS for all articles: http://www.changex.de/Feed/Home/RSS20
- Leads to first subscribers-only page: http://www.changex.de/Article/report...g_fuer_bildung
- IF not logged in, THEN <div class="subscribers weiterlesen">
- IF div class=subscribers weiterlesen", THEN a, page + ?login, e.g. http://www.changex.de/Article/report..._bildung?login
- WHEN ?login, THEN prompt for username & password AND fill <input id="nutzername" type="text" value="" name="username"> AND fill <input id="passwort" type="password" value="" name="password">
- WHEN filled in, THEN <button class="login-send" type="submit">
- You should now get the page with its full content

May anyone help me translating this into python?

Cheers!
Look at any of the subscription recipes or here.

AFAICT, you haven't described how the site determines if you are logged in. Usually it's cookies, so the recipe needs to go to the login page before following any of the article links. The link above shows you how to go to the right page, send the user/password and the recipe then sets the correct login cookies and never sees the redirect to the login page that you seem to be describing. If it's something other than cookies (headers, etc.) then you may need more than the basic tools already built in for handling this.
Starson17 is offline   Reply With Quote
Old 06-02-2011, 05:13 PM   #5
bowbow
Member
bowbow began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
Hey,

thanks for your help. I studied all linked content and could advance a bit.

- Login is managed via cookies (JSESSIONID), so basically your linked solution should work

- I could build a recipe that should do it, yet I have one problem to solve (at least I guess), so here is the setup

Code:
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1307036487(BasicNewsRecipe):
    title          = u'ChangeX Subscription'
    oldest_article = 7
    max_articles_per_feed = 100
    needs_subscription = True

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        br.open('http://www.changex.de/')
        if self.username is not None and self.password is not None:
            br.open('http://www.changex.de/Login')
            br.select_form(name='login')
            br['nutzername']   = self.username
            br['passwort'] = self.password
            br.submit()
        return br

    feeds          = [(u'Arbeit und Leben', u'http://www.changex.de/Feed/ArbeitUndLeben/RSS20'), (u'Wirtschaft und Management', u'http://www.changex.de/Feed/WirtschaftUndManagement/RSS20'), (u'Wissen und Lernen', u'http://www.changex.de/Feed/WissenUndLernen/RSS20')]
The problematic section is
Code:
            br.select_form(name='login')
Because I get the error message
Code:
calibre, version 0.8.3
  File "/var/folders/fF/fFUreYAQGJWaokB+Pqrd+U+++TI/-Tmp-/calibre_0.8.3_tmp_doxSSg/calibre_0.8.3_xq8CvS_recipes/recipe0.py", line 20, in get_browser
    br.select_form(name='login')
  File "site-packages/mechanize/_mechanize.py", line 524, in select_form
mechanize._mechanize.FormNotFoundError: no form matching name 'login'
Problem might be, that the original login form on changex.de/login does not have a name. It is marked as follows:

Code:
<div class="leftside">
<h1>changeX Login (JavaScript-frei)</h1>
<form method="post" action="/Login">
<p>
<label for="nutzername">Nutzername:*</label>
<input id="nutzername" type="text" value="" name="username">
</p>
<p>
<label for="passwort">Passwort:*</label>
<input id="passwort" type="password" value="" name="password">
</p>
<p>
<button type="submit">
</p>
</form>
</div>
Unfortunately, I could not find out how mechanize can process that very form, so I am not sure whether my setup would work otherwise.

Do you know how to direct mechanize to a form without a name?
bowbow is offline   Reply With Quote
Advert
Old 06-02-2011, 05:22 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
use

select_form(nr=1)

this selects by form number (i.e. the order in which forms appear on the page). Change the number for whatever you page has.
kovidgoyal is offline   Reply With Quote
Old 06-03-2011, 08:47 AM   #7
bowbow
Member
bowbow began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
Hey,

that was the decisive tipp, now the recipe works. Could submit that one to the repository. Just one question left, find it below the code.

Code:
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1307036487(BasicNewsRecipe):
    title          = u'ChangeX Subscription'
    oldest_article = 7
    max_articles_per_feed = 100
    needs_subscription = True

    cover_url = 'https://7012901881146393470-a-1802744773732722657-s-sites.googlegroups.com/site/banglabeltze/Home/changex.png?attachauth=ANoY7coFJ1S94rp0tfSsNy40Vkvjz8v2yvVH6ivi5d_wHHwGKbwT9x3wTDGE-SNvpHN9dCG7oC6vEvGFZz7Z75qO5Ho_iXE2_Fr7jqzCBP8kmfRwmGkUlGJMCnQKO52m3u12QHbzEaydSpELKDDc_tKHnOj6OZ-ZRCLuiJYUBM4xYVX43sIh9hvp9mGrlvzPc6mWOYPQAOhmu1p28mLRDOASkEUG9ZZc0w%3D%3D&attredirects=1'


    remove_tags = [
dict(name='div', attrs={'class':['right','optionbox','center']}),
dict(name='div', attrs={'id':['header','footer']}),
dict(name='a', attrs={'class':['top']}),
]

# entfernen aller hotlinks
    def preprocess_html(self, soup):
        for alink in soup.findAll('a'):
            if alink.string is not None:
               tstr = alink.string
               alink.replaceWith(tstr)
        return soup

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        br.open('http://www.changex.de/')
        if self.username is not None and self.password is not None:
            br.open('http://www.changex.de/Login')
            br.select_form(nr=1)
            br['username']   = self.username
#            br['nutzername']   = self.username
            br['password'] = self.password
#            br['passwort'] = self.password
            br.submit()
        return br

    feeds          = [(u'Arbeit und Leben', u'http://www.changex.de/Feed/ArbeitUndLeben/RSS20'), (u'Wirtschaft und Management', u'http://www.changex.de/Feed/WirtschaftUndManagement/RSS20'), (u'Wissen und Lernen', u'http://www.changex.de/Feed/WissenUndLernen/RSS20')]
Just one question left:
- I wanted to create one archive of past articles first, so I set the time as follows:
Code:
    oldest_article = 900
    max_articles_per_feed = 1000
When I open the RSS (e.g. http://www.changex.de/Feed/ArbeitUndLeben/RSS20)
with Google Reader I get a bunch of articles, reaching back to 01.01.2010. However, Calibre just processes data back to Nov 23 2010, leaving aside even feeds from that very same day. That seems very odd to me.

I know recipes are meant only for frequent downloading.
Do you nonetheless have an idea how to correctly get all articles from that feed?

Thanks so much for your support!
bowbow is offline   Reply With Quote
Old 06-03-2011, 11:02 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
rss feeds only publish a few articles at a time. google reader caches those giving you the illusion that they are all there.
kovidgoyal is offline   Reply With Quote
Old 06-03-2011, 11:15 AM   #9
bowbow
Member
bowbow began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
Quote:
Originally Posted by kovidgoyal View Post
rss feeds only publish a few articles at a time. google reader caches those giving you the illusion that they are all there.
And Calibre cannot cache those feeds, right? Is there any work around to fetch older feed articles - either within or beyond Calibre?
bowbow is offline   Reply With Quote
Old 06-03-2011, 11:17 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Find a source that has those articles and write a recipe with a custom rparse_index method to get them.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Request: for recipe deppeler Recipes 4 01-12-2011 09:29 AM
recipe request Torx Recipes 0 12-20-2010 08:33 AM
Request for recipe exdream Calibre 3 04-24-2010 10:13 AM
Request for Recipe ddavtian Calibre 2 11-24-2008 02:43 AM
Request for Recipe girlperson1 Calibre 2 11-14-2008 07:59 AM


All times are GMT -4. The time now is 10:19 PM.


MobileRead.com is a privately owned, operated and funded community.