Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-07-2010, 05:41 PM   #1
miurahr
Junior Member
miurahr began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
Unhappy Nikkei/Problematic site that need form-post before processing

I worked for a recipe for www.nikkei.com japanese economic news site.
After several trial, I got a issue. Any suggestions?

The recipe makes an index that works good but site returns each html that contains automatic post form in order to process login state.

An essence of recipe as follows:

Code:
import string, re, sys
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe

class NikkeiNet_subscription(BasicNewsRecipe):
    title          = u'\u65e5\u7d4c\u65b0\u805e\u96fb\u5b50\u7248'
    __author__     = 'Hiroshi Miura'
    description    = 'News and current market affairs from Japan'
    needs_subscription = True
    oldest_article = 2
    max_articles_per_feed = 20
    language       = 'ja'
    recursions  = 3
    remove_javascript = False

    feeds          =  [ 
                  (u'\u65e5\u7d4c\u4f01\u696d',  u'http://www.zou3.net/php/rss/nikkei2rss.php?head=sangyo')
		]

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('https://id.nikkei.com/lounge/nl/base/LA0010.seam')
            response = br.response()
            response.set_data(response.get_data().replace("<input id=\"j_id48\"", "<!-- "))
            response.set_data(response.get_data().replace("gm_home_on.gif\" />", " -->"))
            br.set_response(response)
            br.select_form(name='LA0010Form01')
            br['LA0010Form01:LA0010Email']   = self.username
            br['LA0010Form01:LA0010Password'] = self.password
            res = br.submit()
            raw = res.read()
            if '日経IDのサービス一覧へ' not in raw:
                raise ValueError('Failed to log in to nikkei.net, check your username(email address) and password')
            br.open('http://www.nikkei.com/')
            br.select_form(nr=0)
            res = br.submit()
            print res.read()
        return br
It returns like: (grab from debug output)

Code:
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja">
<head>
<meta http-equiv="Content-Style-Type" content="text/css"/>
<meta http-equiv="Content-Script-Type" content="text/javascript"/>
<meta http-equiv="Pragma" content="no-cache"/>
<meta http-equiv="Cache-Control" content="no-cache"/>
<meta http-equiv="Expires" content="0"/>
<title/>
<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/><link href="../../stylesheet.css" type="text/css" rel="stylesheet"/><style type="text/css">@page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }</style></head>
<body onload="document.autoPostForm.submit()" class="calibre">

<div class="calibrenavbar">| <a href="../article_1/index.html" class="calibre5">Next</a>
 | <a href="../index.html#article_0" class="calibre5">Section Menu</a> | <a href="../../index.html#feed_0" class="calibre5">Main Menu</a> | <hr class="calibre6"/></div>

<form action="https://id.nikkei.com/lounge/ep/authonly" method="post" name="autoPostForm" class="calibre7">
<div class="calibre7">
<input type="hidden" name="rpid" value="DS"/>
<input type="hidden" name="pxep" value="https://regist.nikkei.com/ds/etc/accounts/auth?url=http%3A%2F%2Fwww.nikkei.com%2Fnews%2Fcategory%2Farticle%2Fg%3D96958A9C93819594E2EAE2E79C8DE2E4E3E3E0E2E3E29F9FE2E2E2E2%3Bat%3DDGXZZO0195165008122009000000"/>
<input type="hidden" name="rtur" value=""/>
<input type="hidden" name="clg" value="715319105982499111506319898"/>
<input type="hidden" name="dps" value="3"/>
<input type="hidden" name="xp0" value=""/>
</div>
<input type="submit" class="calibre8"/>
</form>
<div class="calibrenavbar">
<hr class="calibre6"/>
<p class="calibre9">This article was downloaded by <strong class="calibre10">calibre</strong> from <a href="http://www.nikkei.com/news/category/article/g=96958A9C93819594E2EAE2E79C8DE2E4E3E3E0E2E3E29F9FE2E2E2E2;at=DGXZZO0195165008122009000000" class="calibre5">http://www.nikkei.com/news/category/article/g=96958A9C93819594E2EAE2E79C8DE2E4E3E3E0E2E3E29F9FE2E2E2E2;at=DGXZZO0195165008122009000000</a></p>
<br class="calibre7"/><br class="calibre7"/> | <a href="../index.html#article_0"
 class="calibre5">Section Menu</a> | <a href="../../index.html#feed_0" class="calibre5">Main Menu</a> | </div></body>
</html>
Non subscriber version of this works fine.

It seems that is no good method/function for solve this situation with Calibre API.

Hiroshi
miurahr is offline   Reply With Quote
Old 11-07-2010, 05:43 PM   #2
miurahr
Junior Member
miurahr began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
In the previous recipe, a part

Code:
            br.open('http://www.nikkei.com/')
            br.select_form(nr=0)
            res = br.submit()
            print res.read()
in get_browser() is intend to solve this problem in index page, but same causes on each article, too.
miurahr is offline   Reply With Quote
Advert
Old 11-07-2010, 09:54 PM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Sorry, but I'm not quite able to understand your problem. Is the site asking for authentication on each page of each article?
Starson17 is offline   Reply With Quote
Old 11-20-2010, 07:52 AM   #4
miurahr
Junior Member
miurahr began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
Each page of each article asking for authentication through POST method but it is generated automatically by site as auto submit hidden form.

if i have session with login information, the site send me as

Quote:
<body onload="document.autoPostForm.submit()" >
<form action="https://id.nikkei.com/lounge/ep/authonly" method="post" name="autoPostForm">
<input type="hidden" name="rpid" value="DS"/>
<input type="hidden" name="pxep" value="https://regist.nikkei.com/ds/etc/accounts/auth?url=http%3A%2F%2Fwww.nikkei.com%2Fnews%2Fcate gory%2Farticle%2Fg%3D96958A9C93819594E2EAE2E79C8DE 2E4E3E3E0E2E3E29F9FE2E2E2E2%3Bat%3DDGXZZO019516500 8122009000000"/>
<input type="hidden" name="rtur" value=""/>
<input type="hidden" name="clg" value="715319105982499111506319898"/>
<input type="hidden" name="dps" value="3"/>
<input type="hidden" name="xp0" value=""/>
<input type="submit"/>
</form>
It makes regular browser refresh each pages as full-length article by submitting hidden form.

It is well without login, because the site send me just teaser.
miurahr is offline   Reply With Quote
Old 11-20-2010, 08:09 AM   #5
miurahr
Junior Member
miurahr began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
Unhappy Debug infos

I wanna share whole script and essentials of responses.

I attached whole recipe and response log.

It seems the site replies auth cookie as JS script and it is set through JS.

Quote:
Code:
    function redirect() {
      if (isCookieEnabled() == false) {
        var formdiv = document.getElementById('form-div');
        formdiv.innerHTML = 'please enable Browsers cookie';
        return;
      }

      var checkKey = 'redirectFlag=';
      var checkValue = '1290300579870';
      var cookieCount = document.cookie.length;
      var cookieArray = new Array();
      cookieArray = document.cookie.split('; ');

      var found = false;
      var getValue = '';
      var i = 0;
      while(cookieArray[i]) {
        if (cookieArray[i].substr(0, checkKey.length) == checkKey) {
          getValue = cookieArray[i].substr(checkKey.length, cookieArray[i].length);
          if (getValue == checkValue) {
            found = true;
          }
          break;
        }
        i++;
      }

      if (found == false) {
        document.cookie = checkKey + checkValue + ';';
      }
        var redirectionForm = document.forms['proxy-redirection'];
        if (redirectionForm.action != '') {
            redirectionForm.submit();
        }
    }
    </script>

    <style type="text/css">
    body    {
        color:          #FFFFFF;
        }
    </style>

</head>
<body onload="redirect();">

    <div id="form-div">
        <form method="post" action="https://regist.nikkei.com/ds/etc/accounts/auth?url=http%3A%2F%2Fwww.nikkei.com%2F" id="proxy-redirection"><table>
<tbody>
<tr>
<td>
                    <input type="hidden" name="aa" value="2d33ccde71a2e4e82e85bceb02260addb5705595aecf1e6b" /></td>
</tr>
</tbody>
</table>
<input id="continueButton" type="submit" name="continueButton" value="Continue" style="display: none;;" />
        </form>
    </div>
Attached Files
File Type: txt response.txt (11.5 KB, 379 views)
File Type: txt nikkei_sub.recipe.txt (2.1 KB, 261 views)

Last edited by miurahr; 11-20-2010 at 08:17 PM. Reason: previous post may mislead understandings
miurahr is offline   Reply With Quote
Advert
Old 11-21-2010, 12:58 AM   #6
miurahr
Junior Member
miurahr began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2010
Device: Kindle2(i)
works!

At last, it works with
irregular cookie handling and several form submission in get_browser()

Pls see attachment.
Attached Files
File Type: txt nikkei_sub.recipe.txt (7.1 KB, 428 views)
miurahr is offline   Reply With Quote
Old 11-21-2010, 01:27 PM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by miurahr View Post
At last, it works
Thank you for posting.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Image processing using html2epub? Portnull Calibre 2 06-03-2009 12:31 PM
Text Processing: Some Ideas ahi Workshop 4 05-29-2009 04:35 PM
Update on problematic pdf sarikan iRex 5 01-20-2009 11:10 AM
Perl processing alexxxm Sony Reader 3 11-26-2007 06:13 AM


All times are GMT -4. The time now is 05:16 AM.


MobileRead.com is a privately owned, operated and funded community.