06-01-2011, 02:55 PM | #1 |
Member
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
|
Request: Recipe for ChangeX.de
Hey there,
I am a subscriber of ChangeX.de, and I wonder how to build a recipe for this website: The site offers RSS for all its articles, but you have to get logged in to view most of its content fully. Can you help me mixing that recipe? Thanks! Here is what I can help: - Overview of RSS feeds: http://www.changex.de/Page/Feed - Login-Site: http://www.changex.de/Login - Content within an article is solely within <div class="nl-online">, you can strip all others divs - to try out without login, use the partner section: http://www.changex.de/Feed/Partnerforum/RSS20 - Example article page: http://www.changex.de/Article/rezens...r_coachingwahn |
06-01-2011, 06:37 PM | #2 |
Member
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
|
I managed to get a decent, very clean recipe for the free section, with the help of the tagesschau.de-recipe:
Code:
class AdvancedUserRecipe1306964283(BasicNewsRecipe): title = u'ChangeX' oldest_article = 7 max_articles_per_feed = 100 cover_url = 'https://7012901881146393470-a-1802744773732722657-s-sites.googlegroups.com/site/banglabeltze/Home/changex.png?attachauth=ANoY7coFJ1S94rp0tfSsNy40Vkvjz8v2yvVH6ivi5d_wHHwGKbwT9x3wTDGE-SNvpHN9dCG7oC6vEvGFZz7Z75qO5Ho_iXE2_Fr7jqzCBP8kmfRwmGkUlGJMCnQKO52m3u12QHbzEaydSpELKDDc_tKHnOj6OZ-ZRCLuiJYUBM4xYVX43sIh9hvp9mGrlvzPc6mWOYPQAOhmu1p28mLRDOASkEUG9ZZc0w%3D%3D&attredirects=1' remove_tags = [ dict(name='div', attrs={'class':['right','optionbox']}), dict(name='div', attrs={'id':['header','footer']}), dict(name='a', attrs={'class':['top']}), ] # entfernen aller hotlinks def preprocess_html(self, soup): for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) return soup feeds = [(u'XPartner', u'http://www.changex.de/Feed/Partnerforum/RSS20')] Any help on the subscription part is highly appreciated, I may also help with a (temporary) login! Cheers! |
Advert | |
|
06-01-2011, 07:04 PM | #3 |
Member
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
|
Some ideas on the login-procedure:
I have no clue about python programming, but the logic behind the login is the following: - RSS for all articles: http://www.changex.de/Feed/Home/RSS20 - Leads to first subscribers-only page: http://www.changex.de/Article/report...g_fuer_bildung - IF not logged in, THEN <div class="subscribers weiterlesen"> - IF div class=subscribers weiterlesen", THEN a, page + ?login, e.g. http://www.changex.de/Article/report..._bildung?login - WHEN ?login, THEN prompt for username & password AND fill <input id="nutzername" type="text" value="" name="username"> AND fill <input id="passwort" type="password" value="" name="password"> - WHEN filled in, THEN <button class="login-send" type="submit"> - You should now get the page with its full content May anyone help me translating this into python? Cheers! |
06-02-2011, 11:40 AM | #4 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
AFAICT, you haven't described how the site determines if you are logged in. Usually it's cookies, so the recipe needs to go to the login page before following any of the article links. The link above shows you how to go to the right page, send the user/password and the recipe then sets the correct login cookies and never sees the redirect to the login page that you seem to be describing. If it's something other than cookies (headers, etc.) then you may need more than the basic tools already built in for handling this. |
|
06-02-2011, 05:13 PM | #5 |
Member
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
|
Hey,
thanks for your help. I studied all linked content and could advance a bit. - Login is managed via cookies (JSESSIONID), so basically your linked solution should work - I could build a recipe that should do it, yet I have one problem to solve (at least I guess), so here is the setup Code:
import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class AdvancedUserRecipe1307036487(BasicNewsRecipe): title = u'ChangeX Subscription' oldest_article = 7 max_articles_per_feed = 100 needs_subscription = True def get_browser(self): br = BasicNewsRecipe.get_browser() br.open('http://www.changex.de/') if self.username is not None and self.password is not None: br.open('http://www.changex.de/Login') br.select_form(name='login') br['nutzername'] = self.username br['passwort'] = self.password br.submit() return br feeds = [(u'Arbeit und Leben', u'http://www.changex.de/Feed/ArbeitUndLeben/RSS20'), (u'Wirtschaft und Management', u'http://www.changex.de/Feed/WirtschaftUndManagement/RSS20'), (u'Wissen und Lernen', u'http://www.changex.de/Feed/WissenUndLernen/RSS20')] Code:
br.select_form(name='login') Code:
calibre, version 0.8.3 File "/var/folders/fF/fFUreYAQGJWaokB+Pqrd+U+++TI/-Tmp-/calibre_0.8.3_tmp_doxSSg/calibre_0.8.3_xq8CvS_recipes/recipe0.py", line 20, in get_browser br.select_form(name='login') File "site-packages/mechanize/_mechanize.py", line 524, in select_form mechanize._mechanize.FormNotFoundError: no form matching name 'login' Code:
<div class="leftside"> <h1>changeX Login (JavaScript-frei)</h1> <form method="post" action="/Login"> <p> <label for="nutzername">Nutzername:*</label> <input id="nutzername" type="text" value="" name="username"> </p> <p> <label for="passwort">Passwort:*</label> <input id="passwort" type="password" value="" name="password"> </p> <p> <button type="submit"> </p> </form> </div> Do you know how to direct mechanize to a form without a name? |
Advert | |
|
06-02-2011, 05:22 PM | #6 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
use
select_form(nr=1) this selects by form number (i.e. the order in which forms appear on the page). Change the number for whatever you page has. |
06-03-2011, 08:47 AM | #7 |
Member
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
|
Hey,
that was the decisive tipp, now the recipe works. Could submit that one to the repository. Just one question left, find it below the code. Code:
import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class AdvancedUserRecipe1307036487(BasicNewsRecipe): title = u'ChangeX Subscription' oldest_article = 7 max_articles_per_feed = 100 needs_subscription = True cover_url = 'https://7012901881146393470-a-1802744773732722657-s-sites.googlegroups.com/site/banglabeltze/Home/changex.png?attachauth=ANoY7coFJ1S94rp0tfSsNy40Vkvjz8v2yvVH6ivi5d_wHHwGKbwT9x3wTDGE-SNvpHN9dCG7oC6vEvGFZz7Z75qO5Ho_iXE2_Fr7jqzCBP8kmfRwmGkUlGJMCnQKO52m3u12QHbzEaydSpELKDDc_tKHnOj6OZ-ZRCLuiJYUBM4xYVX43sIh9hvp9mGrlvzPc6mWOYPQAOhmu1p28mLRDOASkEUG9ZZc0w%3D%3D&attredirects=1' remove_tags = [ dict(name='div', attrs={'class':['right','optionbox','center']}), dict(name='div', attrs={'id':['header','footer']}), dict(name='a', attrs={'class':['top']}), ] # entfernen aller hotlinks def preprocess_html(self, soup): for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) return soup def get_browser(self): br = BasicNewsRecipe.get_browser() br.open('http://www.changex.de/') if self.username is not None and self.password is not None: br.open('http://www.changex.de/Login') br.select_form(nr=1) br['username'] = self.username # br['nutzername'] = self.username br['password'] = self.password # br['passwort'] = self.password br.submit() return br feeds = [(u'Arbeit und Leben', u'http://www.changex.de/Feed/ArbeitUndLeben/RSS20'), (u'Wirtschaft und Management', u'http://www.changex.de/Feed/WirtschaftUndManagement/RSS20'), (u'Wissen und Lernen', u'http://www.changex.de/Feed/WissenUndLernen/RSS20')] - I wanted to create one archive of past articles first, so I set the time as follows: Code:
oldest_article = 900 max_articles_per_feed = 1000 with Google Reader I get a bunch of articles, reaching back to 01.01.2010. However, Calibre just processes data back to Nov 23 2010, leaving aside even feeds from that very same day. That seems very odd to me. I know recipes are meant only for frequent downloading. Do you nonetheless have an idea how to correctly get all articles from that feed? Thanks so much for your support! |
06-03-2011, 11:02 AM | #8 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
rss feeds only publish a few articles at a time. google reader caches those giving you the illusion that they are all there.
|
06-03-2011, 11:15 AM | #9 |
Member
Posts: 10
Karma: 10
Join Date: Jun 2011
Device: kindle 3
|
|
06-03-2011, 11:17 AM | #10 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Find a source that has those articles and write a recipe with a custom rparse_index method to get them.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Request: for recipe | deppeler | Recipes | 4 | 01-12-2011 09:29 AM |
recipe request | Torx | Recipes | 0 | 12-20-2010 08:33 AM |
Request for recipe | exdream | Calibre | 3 | 04-24-2010 10:13 AM |
Request for Recipe | ddavtian | Calibre | 2 | 11-24-2008 02:43 AM |
Request for Recipe | girlperson1 | Calibre | 2 | 11-14-2008 07:59 AM |