![]() |
#1 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
|
![]()
I worked for a recipe for www.nikkei.com japanese economic news site.
After several trial, I got a issue. Any suggestions? The recipe makes an index that works good but site returns each html that contains automatic post form in order to process login state. An essence of recipe as follows: Code:
import string, re, sys from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe class NikkeiNet_subscription(BasicNewsRecipe): title = u'\u65e5\u7d4c\u65b0\u805e\u96fb\u5b50\u7248' __author__ = 'Hiroshi Miura' description = 'News and current market affairs from Japan' needs_subscription = True oldest_article = 2 max_articles_per_feed = 20 language = 'ja' recursions = 3 remove_javascript = False feeds = [ (u'\u65e5\u7d4c\u4f01\u696d', u'http://www.zou3.net/php/rss/nikkei2rss.php?head=sangyo') ] def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: br.open('https://id.nikkei.com/lounge/nl/base/LA0010.seam') response = br.response() response.set_data(response.get_data().replace("<input id=\"j_id48\"", "<!-- ")) response.set_data(response.get_data().replace("gm_home_on.gif\" />", " -->")) br.set_response(response) br.select_form(name='LA0010Form01') br['LA0010Form01:LA0010Email'] = self.username br['LA0010Form01:LA0010Password'] = self.password res = br.submit() raw = res.read() if '日経IDのサービス一覧へ' not in raw: raise ValueError('Failed to log in to nikkei.net, check your username(email address) and password') br.open('http://www.nikkei.com/') br.select_form(nr=0) res = br.submit() print res.read() return br Code:
<?xml version='1.0' encoding='utf-8'?> <html xmlns="http://www.w3.org/1999/xhtml" lang="ja"> <head> <meta http-equiv="Content-Style-Type" content="text/css"/> <meta http-equiv="Content-Script-Type" content="text/javascript"/> <meta http-equiv="Pragma" content="no-cache"/> <meta http-equiv="Cache-Control" content="no-cache"/> <meta http-equiv="Expires" content="0"/> <title/> <meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/><link href="../../stylesheet.css" type="text/css" rel="stylesheet"/><style type="text/css">@page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }</style></head> <body onload="document.autoPostForm.submit()" class="calibre"> <div class="calibrenavbar">| <a href="../article_1/index.html" class="calibre5">Next</a> | <a href="../index.html#article_0" class="calibre5">Section Menu</a> | <a href="../../index.html#feed_0" class="calibre5">Main Menu</a> | <hr class="calibre6"/></div> <form action="https://id.nikkei.com/lounge/ep/authonly" method="post" name="autoPostForm" class="calibre7"> <div class="calibre7"> <input type="hidden" name="rpid" value="DS"/> <input type="hidden" name="pxep" value="https://regist.nikkei.com/ds/etc/accounts/auth?url=http%3A%2F%2Fwww.nikkei.com%2Fnews%2Fcategory%2Farticle%2Fg%3D96958A9C93819594E2EAE2E79C8DE2E4E3E3E0E2E3E29F9FE2E2E2E2%3Bat%3DDGXZZO0195165008122009000000"/> <input type="hidden" name="rtur" value=""/> <input type="hidden" name="clg" value="715319105982499111506319898"/> <input type="hidden" name="dps" value="3"/> <input type="hidden" name="xp0" value=""/> </div> <input type="submit" class="calibre8"/> </form> <div class="calibrenavbar"> <hr class="calibre6"/> <p class="calibre9">This article was downloaded by <strong class="calibre10">calibre</strong> from <a href="http://www.nikkei.com/news/category/article/g=96958A9C93819594E2EAE2E79C8DE2E4E3E3E0E2E3E29F9FE2E2E2E2;at=DGXZZO0195165008122009000000" class="calibre5">http://www.nikkei.com/news/category/article/g=96958A9C93819594E2EAE2E79C8DE2E4E3E3E0E2E3E29F9FE2E2E2E2;at=DGXZZO0195165008122009000000</a></p> <br class="calibre7"/><br class="calibre7"/> | <a href="../index.html#article_0" class="calibre5">Section Menu</a> | <a href="../../index.html#feed_0" class="calibre5">Main Menu</a> | </div></body> </html> It seems that is no good method/function for solve this situation with Calibre API. Hiroshi |
![]() |
![]() |
![]() |
#2 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
|
In the previous recipe, a part
Code:
br.open('http://www.nikkei.com/') br.select_form(nr=0) res = br.submit() print res.read() |
![]() |
![]() |
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Sorry, but I'm not quite able to understand your problem. Is the site asking for authentication on each page of each article?
|
![]() |
![]() |
![]() |
#4 | |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
|
Each page of each article asking for authentication through POST method but it is generated automatically by site as auto submit hidden form.
if i have session with login information, the site send me as Quote:
It is well without login, because the site send me just teaser. |
|
![]() |
![]() |
![]() |
#5 | |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
|
![]()
I wanna share whole script and essentials of responses.
![]() I attached whole recipe and response log. It seems the site replies auth cookie as JS script and it is set through JS. Quote:
Last edited by miurahr; 11-20-2010 at 08:17 PM. Reason: previous post may mislead understandings |
|
![]() |
![]() |
![]() |
#6 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
|
works!
At last, it works with
irregular cookie handling and several form submission in get_browser() Pls see attachment. |
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Image processing using html2epub? | Portnull | Calibre | 2 | 06-03-2009 12:31 PM |
Text Processing: Some Ideas | ahi | Workshop | 4 | 05-29-2009 04:35 PM |
Update on problematic pdf | sarikan | iRex | 5 | 01-20-2009 11:10 AM |
Perl processing | alexxxm | Sony Reader | 3 | 11-26-2007 06:13 AM |