Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-29-2010, 05:52 PM   #1
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Need some help creating a login for a recipe

[[SOLVED]]

--Hi all,

I have been trying to create a recipe (modified from an outdated one by mr. Mellink), to download articles from the Dutch newspaper Volkskrant.

I have been able to correct the title reading and index creating of the original script to the new newspaper layout. However, the login part doesn't seem to work. This may have something to do with the fact that the site uses specific functions for it's form. Perhaps you can help me.

I have pasted the login code of the recipe and the URL form code below. The form loads (I can print the hidden values) but after the submit command, articles still refer to the login page (ie. not logged in).

I just can't seem to persuade the login to work correctly. I hope anyone has any ideas?

The part of the recipe that loads the form:
Code:
class Volkskrant_full(BasicNewsRecipe):
    title                 = strftime('Volkskrant: %Y%m%d')
    __author__            = u'Jaap Mellink'
    description           = u"Volkskrant"
    oldest_article        = 30
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    simultaneous_downloads = 1
    delay = 1
    needs_subscription = True
    INDEX_MAIN = strftime('http://www.volkskrant.nl/vk-online/VK/%Y%m%d___/1_001/#text')
    INDEX_ARTICLE = strftime('http://www.volkskrant.nl/vk-online/VK/%Y%m%d___/1_001/')
    LOGIN = 'http://www.volkskrant.nl/gatekeeper/login.jsp'
    remove_tags = [dict(name='address')]
    		
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
		
        if self.username is not None and self.password is not None:
            br.open(self.LOGIN)
            br.select_form(name="UserLogin")
            print br.select_form(name="UserLogin")
            br['userName'] = self.username
            br['password'] = self.password
            br.submit()
        return br
Then there is the form from the website (source at http://www.volkskrant.nl/gatekeeper/login.jsp):
Code:
<form action="/action" method="post" name="UserLogin" id="UserLogin">
			<input type="hidden" name="action" value="login"/>
			<input type="hidden" name="goto" value="/gatekeeper/view-profile.jsp"/>
			<input type="hidden" name="source" value="/gatekeeper/login.jsp"/>
		        <input type="hidden" name="entree" value="nwsl"/>
			<input type="hidden" name="success" value="/gatekeeper/view-profile.jsp"/>

			<div class="left">
				<h3 class="">Gebruikersnaam:<span class="mandatory">*</span></h3>
			</div>
			<div class="right">
				<input type="text" name="userName" id="userName" size="30" maxlength="28" value="" class="formfield"/>

			</div>
		<br />

			<div class="left">
				<h3 class="">Wachtwoord:<span class="mandatory">*</span></h3>
			</div>
			<div class="right">
				<input type="password" name="password" id="password" maxlength="28" size="30" class="formfield"/>

			</div>

			<div class="clear"></div>

			<div class="plain">
				<input name="saveuserIdPassword" type="checkbox" value="yes" checked="checked"/>Uw gebruikersnaam en wachtwoord opslaan op deze computer (aanbevolen)
			</div>

			<br/>
			<input type="image" src="/gatekeeper/images/but-login.gif" alt="login"  onclick="validateData()"/>
			<div class="plain">
				<ul class="links">
					<li><b><a href="/gatekeeper/register_only.jsp">Nog geen login? Registreren</a></b></li>
				</ul>
			</div>
			<br/>
			<br/>
		</form>
And finally, for reference, the full recipe so far:
Code:
from calibre import strftime
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Volkskrant_full(BasicNewsRecipe):
    title                 = strftime('Volkskrant: %Y%m%d')
    __author__            = u'Jaap Mellink'
    description           = u"Volkskrant"
    oldest_article        = 30
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    simultaneous_downloads = 1
    delay = 1
    needs_subscription = True
    INDEX_MAIN = strftime('http://www.volkskrant.nl/vk-online/VK/%Y%m%d___/1_001/#text')
    INDEX_ARTICLE = strftime('http://www.volkskrant.nl/vk-online/VK/%Y%m%d___/1_001/')
    LOGIN = 'http://www.volkskrant.nl/gatekeeper/login.jsp'
    #TEST = 'http://www.volkskrant.nl/vk-online/VK/20100109___/1_001/article9_text.html'
	
    #keep_only_tags = [ dict(name='div', attrs={'class':'page'})] 
    #keep_only_tags = []
    #remove_tags = [{'class':['info']}, dict(name='address')]
    remove_tags = [dict(name='address')]
    #keep_only_tags = [{'class':['article HorizontalHeader',
    #    'articlecontent','photoBox', 'article columnist first']}, ]
		

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
		
        if self.username is not None and self.password is not None:
            br.open(self.LOGIN)
            br.select_form(name="UserLogin")
            print br.select_form(name="UserLogin")
            br['userName'] = self.username
            br['password'] = self.password
            br.submit()
        return br
        
    def parse_index(self):
        krant = []
	def strip_title(_title):
            i = 0 
            while ((_title[i] <> ":") and (i <= len(_title))): 
               i = i + 1
            return(_title[0:i])		     
        print 'Processing ' + self.INDEX_MAIN
        soup = self.index_to_soup(self.INDEX_MAIN)
	mainsoup = soup.find('td', attrs={'id': 'select_page_top'})
	for option in mainsoup.findAll('option'):
            articles = []
            _INDEX = strftime('http://www.volkskrant.nl/vk-online/VK/%Y%m%d___/') + option['value'] + '/#text'
            _INDEX_ARTICLE = strftime('http://www.volkskrant.nl/vk-online/VK/%Y%m%d___/') + option['value'] + '/'
            print 'Processing ' + option['value']
            soup = self.index_to_soup(_INDEX)
            for item in soup.findAll('area'):
		art_nr = item['class']
		attrname = art_nr[0:11] + '_section' + option['value'][0:1] + '_' + art_nr[12:len(art_nr)]
		index_title = soup.find('div', attrs={'class': attrname})
		get_title = index_title['title'];
		url   = _INDEX_ARTICLE + attrname + '.html#text'
                title = get_title;
                if (get_title <> ''):
                     title = strip_title(get_title)
                     date  = strftime(' %B %Y')
                if (title <> ''):
	                articles.append({
                                         'title'      :title
       	                                 ,'date'       :date
        	                         ,'url'        :url
       	                                 ,'description':''
                                        })
            krant.append( (option.string, articles))

        return krant

Last edited by Selcal; 07-30-2010 at 06:48 AM. Reason: Problem solved
Selcal is offline   Reply With Quote
Old 07-29-2010, 07:16 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Selcal View Post
-- the login part doesn't seem to work.
Replicate what's happening in Calibre in a browser (FireFox), or what's happening in the browser in Calibre. To do the former - use the Live HTTP Headers add-on. To do the latter, use this in the recipe :
Code:
        # Log information about HTTP redirects and Refreshes.
        br.set_debug_redirects(True)
        # Log HTTP response bodies (ie. the HTML, most of the time).
        br.set_debug_responses(True)
        # Print HTTP headers.
        br.set_debug_http(True)
It could be a problem with the referer. Try turning referer off in FireFox and see if that replicates what happens in Calibre. Compare the http headers in the two cases.
Starson17 is offline   Reply With Quote
Advert
Old 07-29-2010, 11:32 PM   #3
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Thanks very much for trying to help me out here. Though I'm starting to think this may be beyond my skills ...

But I had a go. The main differences I see between the headers:

In Calibre, "Connection: close" appears where Firefox shows "Connection: keep-alive" but this may be normal?

In Firefox, after the login there is a discription of the cookie that is set. This does not show in Calibre:

Code:
Set-Cookie: userId=kglazenburg1805; Domain=.volkskrant.nl; Expires=Sat, 30-Jul-2011 03:25:18 GMT; Path=/
Set-Cookie: userId=kglazenburg1805; Expires=Sat, 30-Jul-2011 03:25:18 GMT; Path=/
Set-Cookie: password=xxxxxxxdeletedxxxxx; Expires=Sat, 30-Jul-2011 03:25:18 GMT; Path=/
Set-Cookie: subscriptions=Abonnee+registrant; Domain=.volkskrant.nl; Expires=Sat, 30-Jul-2011 03:25:18 GMT; Path=/
Set-Cookie: temporaryAccesses=; Domain=.volkskrant.nl; Expires=Sat, 30-Jul-2011 03:25:18 GMT; Path=/
Set-Cookie: strippen=0; Domain=.volkskrant.nl; Expires=Sat, 30-Jul-2011 03:25:18 GMT; Path=/
Set-Cookie: displayUserName=kglazenburg1805; Domain=.volkskrant.nl; Expires=Sat, 30-Jul-2011 03:25:18 GMT; Path=/
Set-Cookie: iPlanetDirectoryPro=AQIC5wM2LY4Sfcy61w2X+zQy5aRwtskeQOFWCDxdW10ivvY=@AAJTSQACMDE=#; Domain=.volkskrant.nl; Path=/
Set-Cookie: sessionId=JSESSIONID=AQIC5wM2LY4Sfcy61w2X%2BzQy5aRwtskeQOFWCDxdW10ivvY%3D%40AAJTSQACMDE%3D%23; Domain=.volkskrant.nl; Path=/
Should the cookie be visible in Calibre?
Selcal is offline   Reply With Quote
Old 07-30-2010, 05:56 AM   #4
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Oh and I tried disabling the referer in Firefox but the site works normally without it set.
Selcal is offline   Reply With Quote
Old 07-30-2010, 06:47 AM   #5
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Maybe not so bad after all .

I cleaned up the code (I really have to get used to Python and it's rules) and the cookie worked. From there it was a matter of correcting some download mistakes here and there and it worked.

Now I'll test it over the next few days, see it it keeps working for every edition.

Thanks for the help!
Selcal is offline   Reply With Quote
Advert
Old 07-30-2010, 07:45 AM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Selcal View Post
Maybe not so bad after all .

I cleaned up the code (I really have to get used to Python and it's rules) and the cookie worked.
That's great!
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Financial Times / FT - help creating a UK print edition recipe ndeb123 Recipes 1 09-29-2010 10:55 AM
The Secret Recipe for Creating a Romance Book schroedercl2 News 49 01-23-2010 02:54 PM
Creating a Recipe for PS3 Center? cypherslock Calibre 3 12-27-2009 09:29 PM
Kindle Login rubikscube99 Amazon Kindle 20 03-12-2009 07:19 PM


All times are GMT -4. The time now is 04:51 AM.


MobileRead.com is a privately owned, operated and funded community.