07-15-2016, 06:50 AM | #1 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Irish Times - Problems Entering Subscription
Hello all,
I'm looking for help entering email & password details into the following page: http://www.irishtimes.com/signin I've been trying to use code from other recipes with subscription models but not having much success. So far I've come up with the following modified recipe: Code:
__license__ = 'GPL v3' __copyright__ = "2008, Derry FitzGerald. 2009 Modified by Ray Kinsella and David O'Callaghan, 2011 Modified by Phil Burns, 2013 Tom Scholl" ''' irishtimes.com ''' import urlparse, re from calibre.web.feeds.news import BasicNewsRecipe from calibre.ptempfile import PersistentTemporaryFile class IrishTimes(BasicNewsRecipe): title = u'The Irish Times' __author__ = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns, Tom Scholl" description = 'Daily news from The Irish Times' needs_subscription = True def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('http://www.irishtimes.com/signin') br.form = br.forms().next() br['email'] = self.username br['password'] = self.password raw = br.submit().read() if 'Please try again' in raw: raise Exception('Your username and password are incorrect') return br language = 'en_IE' masthead_url = 'http://www.irishtimes.com/assets/images/generic/website/logo_theirishtimes.png' encoding = 'utf-8' oldest_article = 1.0 max_articles_per_feed = 100 remove_empty_feeds = True no_stylesheets = True temp_files = [] articles_are_obfuscated = True feeds = [ ('News', 'http://www.irishtimes.com/cmlink/the-irish-times-news-1.1319192'), ('World', 'http://www.irishtimes.com/cmlink/irishtimesworldfeed-1.1321046'), ('Politics', 'http://www.irishtimes.com/cmlink/irish-times-politics-rss-1.1315953'), ('Business', 'http://www.irishtimes.com/cmlink/the-irish-times-business-1.1319195'), ('Culture', 'http://www.irishtimes.com/cmlink/the-irish-times-culture-1.1319213'), # Not interested in sport so commented out.. # ('Sport', 'http://www.irishtimes.com/cmlink/the-irish-times-sport-1.1319194'), ('Debate', 'http://www.irishtimes.com/cmlink/debate-1.1319211'), ('Life & Style', 'http://www.irishtimes.com/cmlink/the-irish-times-life-style-1.1319214'), ] def get_obfuscated_article(self, url): # Insert a pic from the original url, but use content from the print url pic = None pics = self.index_to_soup(url) div = pics.find('div', {'class' : re.compile('image-carousel')}) if div: pic = div.img if pic: try: pic['src'] = urlparse.urljoin(url, pic['src']) pic.extract() except: pic = None content = self.index_to_soup(url + '?mode=print&ot=example.AjaxPageLayout.ot') if pic: content.p.insert(0, pic) self.temp_files.append(PersistentTemporaryFile('_fa.html')) self.temp_files[-1].write(content.prettify()) self.temp_files[-1].close() return self.temp_files[-1].name Can anyone point me in the right direction? Thanks, Leo |
07-16-2016, 06:17 AM | #2 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Some progress, I'm now getting a response from the website. However it saying that it's an invalid username or password (even if the correct ones are used), probably because the fields aren't being filled in correctly.
Perhaps I'm not selecting the correct form (I think it 'itPaywall'). Code:
__license__ = 'GPL v3' __copyright__ = "2008, Derry FitzGerald. 2009 Modified by Ray Kinsella and David O'Callaghan, 2011 Modified by Phil Burns, 2013 Tom Scholl" ''' irishtimes.com ''' import urlparse, re from calibre.web.feeds.news import BasicNewsRecipe from calibre.ptempfile import PersistentTemporaryFile class IrishTimes(BasicNewsRecipe): title = u'The Irish Times' __author__ = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns, Tom Scholl" description = 'Daily news from The Irish Times' needs_subscription = True language = 'en_IE' masthead_url = 'http://www.irishtimes.com/assets/images/generic/website/logo_theirishtimes.png' encoding = 'utf-8' oldest_article = 1.0 max_articles_per_feed = 100 remove_empty_feeds = True no_stylesheets = True temp_files = [] articles_are_obfuscated = True feeds = [ ('News', 'http://www.irishtimes.com/cmlink/the-irish-times-news-1.1319192'), ('World', 'http://www.irishtimes.com/cmlink/irishtimesworldfeed-1.1321046'), ('Politics', 'http://www.irishtimes.com/cmlink/irish-times-politics-rss-1.1315953'), ('Business', 'http://www.irishtimes.com/cmlink/the-irish-times-business-1.1319195'), ('Culture', 'http://www.irishtimes.com/cmlink/the-irish-times-culture-1.1319213'), # Not interested in sport so commented out.. # ('Sport', 'http://www.irishtimes.com/cmlink/the-irish-times-sport-1.1319194'), ('Debate', 'http://www.irishtimes.com/cmlink/debate-1.1319211'), ('Life & Style', 'http://www.irishtimes.com/cmlink/the-irish-times-life-style-1.1319214'), ] def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('http://www.irishtimes.com/signin') # is the correct form being selected below???? br.form = br.forms().next() br['email'] = self.username br['password'] = self.password raw = br.submit().read() #print raw if 'Invalid email or password' in raw: raise Exception('Your username and password are incorrect') return br def get_obfuscated_article(self, url): # Insert a pic from the original url, but use content from the print url pic = None pics = self.index_to_soup(url) div = pics.find('div', {'class' : re.compile('image-carousel')}) if div: pic = div.img if pic: try: pic['src'] = urlparse.urljoin(url, pic['src']) pic.extract() except: pic = None content = self.index_to_soup(url + '?mode=print&ot=example.AjaxPageLayout.ot') if pic: content.p.insert(0, pic) self.temp_files.append(PersistentTemporaryFile('_fa.html')) self.temp_files[-1].write(content.prettify()) self.temp_files[-1].close() return self.temp_files[-1].name |
Advert | |
|
07-16-2016, 07:45 AM | #3 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Yes, I think that should select the right form (the first one). Although you could also try this command if you are in doubt:
Code:
br.select_form(nr=0) Code:
if 'Invalid email or password' in raw: raise Exception('Your username and password are incorrect') |
07-16-2016, 03:35 PM | #4 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Hello,
Many thanks for your reply! You're correct, that 'Your username and password are incorrect' is present in the page before the submit button is pushed so I edited that section out. Is there a simple way to verify it properly? As suggested I added the snippet of code for the form & verified that the correct form was selected (by printing it to screen). It outputted: Code:
<POST https://www.irishtimes.com/signin# application/x-www-form-urlencoded <TextControl(email=)> <PasswordControl(password=)> <SubmitButtonControl(<None>=) (readonly)>> Do I need to handle importing text from the command line argument? I haven’t added anything in that regard. Anything else you can think of? Thanks again for looking, Leo |
07-18-2016, 03:09 PM | #5 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Not an issue importing password, or username from the command line
|
Advert | |
|
07-18-2016, 03:35 PM | #6 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
after running with -vv option it looks like it may be a recipe issue rather than a login problem.
I'm see a lots of occasions of: Code:
13% Article download failed: UK’s Trident nuclear programme splits Labour three ways Failed to download article: Nice attack: ‘No words describe hell of bringing one’s child to the cemetery’ from http://www.irishtimes.com/news/world...tery-1.2725455 Traceback (most recent call last): File "site-packages/calibre/utils/threadpool.py", line 95, in run File "site-packages/calibre/web/feeds/news.py", line 1125, in fetch_obfuscated_article File "<string>", line 89, in get_obfuscated_article ValueError: I/O operation on closed file Code:
Could not fetch image file:///polopoly_fs/1.2723622.1468599710!/image/image.jpg_gen/derivatives/landscape_140/image.jpg Traceback (most recent call last): File "site-packages/calibre/web/fetch/simple.py", line 377, in process_images File "site-packages/calibre/web/fetch/simple.py", line 229, in fetch_url IOError: [Errno 2] No such file or directory: u'/polopoly_fs/1.2723622.1468599710!/image/image.jpg_gen/derivatives/landscape_140/image.jpg' Fetching file:///assets/images/icons/apps/app-store.png Code:
20% Article download failed: Half of Irish consumers using contactless payments Failed to download article: EU re-introduces milk supply controls barely a year after quotas from http://www.irishtimes.com/business/a...otas-1.2726088 Traceback (most recent call last): File "site-packages/calibre/utils/threadpool.py", line 95, in run File "site-packages/calibre/web/feeds/news.py", line 1125, in fetch_obfuscated_article File "<string>", line 89, in get_obfuscated_article ValueError: I/O operation on closed file |
12-03-2016, 04:14 PM | #7 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Just getting back to this after a break. It looks like some issue around the submit button.
I've read up a little on the br.submit() command. Could it be that some javascript is needs to be executed to verify the login details after the button press which mechanize is unable to handle? Should I try use use POST instead? Any help appreciated. Leo |
12-03-2016, 10:23 PM | #8 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yes, generally when a plain submit() does not work, it means there is javascript behind the scenes. WHat you do then is use the developer tools in a regular browser to see the requests generated by the login page when you click submit and clone them in the recipe. An example of doing that is in the WSJ recipe.
|
12-05-2016, 04:37 PM | #9 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Many thanks,
I managed to capture the js: Code:
Host: www.irishtimes.com User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0 Accept: */* Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate, br Content-Type: application/x-www-form-urlencoded; charset=UTF-8 X-Requested-With: XMLHttpRequest Referer: https://www.irishtimes.com/signin Content-Length: 106 Cookie: IT_cookiepopup=1; pw_meter_news=14815732..................8edbe; pw_cache=0....1480968432.IE.0.0...0xd12fffc3543.........6bb793bc2d38; IT_UUID=69164............b0758e DNT: 1 Connection: keep-alive Leo Last edited by leo738; 12-05-2016 at 04:48 PM. |
12-07-2016, 07:24 AM | #10 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Managed to get something going:
Code:
__license__ = 'GPL v3' __copyright__ = "2008, Derry FitzGerald. 2009 Modified by Ray Kinsella and David O'Callaghan, 2011 Modified by Phil Burns, 2013 Tom Scholl" ''' irishtimes.com ''' import urlparse, re import json from mechanize import Request from calibre.web.feeds.news import BasicNewsRecipe from calibre.ptempfile import PersistentTemporaryFile USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0' class IrishTimes(BasicNewsRecipe): title = u'The Irish Times' __author__ = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns, Tom Scholl" description = 'Daily news from The Irish Times' needs_subscription = True language = 'en_IE' masthead_url = 'http://www.irishtimes.com/assets/images/generic/website/logo_theirishtimes.png' encoding = 'utf-8' oldest_article = 1.0 max_articles_per_feed = 100 simultaneous_downloads = 5 remove_empty_feeds = True no_stylesheets = True temp_files = [] articles_are_obfuscated = True feeds = [ ('News', 'https://www.irishtimes.com/cmlink/the-irish-times-news-1.1319192'), ('World', 'https://www.irishtimes.com/cmlink/irishtimesworldfeed-1.1321046'), ('Politics', 'https://www.irishtimes.com/cmlink/irish-times-politics-rss-1.1315953'), ('Business', 'https://www.irishtimes.com/cmlink/the-irish-times-business-1.1319195'), ('Culture', 'https://www.irishtimes.com/cmlink/the-irish-times-culture-1.1319213'), # Not interested in sport so commented out.. # ('Sport', 'https://www.irishtimes.com/cmlink/the-irish-times-sport-1.1319194'), ('Debate', 'https://www.irishtimes.com/cmlink/debate-1.1319211'), ('Life & Style', 'https://www.irishtimes.com/cmlink/the-irish-times-life-style-1.1319214'), ] def get_browser(self): # To understand the signin logic read signin javascript from submit button from # https://www.irishtimes.com/signin br = BasicNewsRecipe.get_browser(self, user_agent=USER_AGENT) url = 'https://www.irishtimes.com/signin' br.set_debug_http(True) br.open(url).read() rurl = 'https://www.irishtimes.com/auth-rest-api/v1/paywall/login' rq = Request(rurl, headers={ 'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Referer': url, 'X-Requested-With': 'XMLHttpRequest', }, data=json.dumps({ 'username': self.username, 'password': self.password, 'deviceid': '53c835787f4d2406131985553c1842d0', 'persistent': 'on', })) r = br.open(rq) if r.code != 200: raise ValueError('Failed to login, check username and password') data = json.loads(r.read()) print(data) #if data.get('result') != 'success': # raise ValueError( # 'Failed to login (XHR failed), check username and password') #br.set_cookie('m', data['username'], '.wsj.com') #r = br.open(data['url']) #self.wsj_itp_page = raw = r.read() #if b'>Sign Out<' not in raw: # raise ValueError( # 'Failed to login (auth URL failed), check username and password') # open('/t/raw.html', 'w').write(raw) return br def get_obfuscated_article(self, url): # Insert a pic from the original url, but use content from the print url pic = None pics = self.index_to_soup(url) div = pics.find('div', {'class' : re.compile('image-carousel')}) if div: pic = div.img if pic: try: pic['src'] = urlparse.urljoin(url, pic['src']) pic.extract() except: pic = None content = self.index_to_soup(url + '?mode=print&ot=example.AjaxPageLayout.ot') if pic: content.p.insert(0, pic) self.temp_files.append(PersistentTemporaryFile('_fa.html')) self.temp_files[-1].write(content.prettify()) self.temp_files[-1].close() return self.temp_files[-1].name Any pointers what it is?? Thanks, Leo |
12-07-2016, 07:32 AM | #11 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's likely an id that is generated using browser fingerprinting and helps track users. You can probably just use a random string for it in the same format as you you got for your browser.
|
12-09-2016, 07:30 AM | #12 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Thanks for the reply but not getting very far on this..
On hitting the 'sigin' button the following POST is sent to: https://www.irishtimes.com/auth-rest-api/v1/paywall/login Code:
Host: www.irishtimes.com User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0 Accept: */* Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate, br Content-Type: application/x-www-form-urlencoded; charset=UTF-8 X-Requested-With: XMLHttpRequest Referer: https://www.irishtimes.com/signin Content-Length: 106 Cookie: IT_UUID=1150a714-be0a-11e6-b6a8-005056b0758e; IT_cookiepopup=1 DNT: 1 Connection: keep-alive Code:
username=ABCDEF%40gmail.com&password=123456&deviceid=53c835787f4d2406131985633c1942d0&persistent=on Code:
def get_browser(self): # To understand the signin logic read signin javascript from submit button from # https://www.irishtimes.com/signin br = BasicNewsRecipe.get_browser(self, user_agent=USER_AGENT) url = 'https://www.irishtimes.com/signin' br.set_debug_http(True) br.open(url).read() rurl = 'https://www.irishtimes.com/auth-rest-api/v1/paywall/login' rq = Request(rurl, headers={ 'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Referer': url, 'X-Requested-With': 'XMLHttpRequest', }, data=json.dumps({ 'username': self.username, 'password': self.password, 'deviceid': '53c835787f4d2406131985633c1842d0', 'persistent': 'on', })) r = br.open(rq) if r.code != 200: raise ValueError('Failed to login, check username and password') data = json.loads(r.read()) print(data) #if data.get('result') != 'success': # raise ValueError( # 'Failed to login (XHR failed), check username and password') #br.set_cookie('m', data['username'], '.wsj.com') #r = br.open(data['url']) #self.wsj_itp_page = raw = r.read() #if b'>Sign Out<' not in raw: # raise ValueError( # 'Failed to login (auth URL failed), check username and password') # open('/t/raw.html', 'w').write(raw) return br Code:
send: 'GET /signin HTTP/1.1\r\nAccept-Encoding: identity\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0\r\nHost: www.irishtimes.com\r\nAccept: */*\r\nConnection: close\r\n\r\n' reply: 'HTTP/1.1 200 OK\r\n' header: Server: Apache-Coyote/1.1 header: Content-Type: text/html;charset=utf-8 header: Last-Modified: Fri, 09 Dec 2016 12:26:34 GMT header: X-Cacheable: YES header: Content-Length: 72338 header: Accept-Ranges: bytes header: Date: Fri, 09 Dec 2016 12:27:43 GMT header: Connection: keep-alive header: X-Pw-Hits: 1 header: Set-Cookie: IT_UUID=e23fb6da-be0a-11e6-bd74-005056a02a54; domain=.irishtimes.com; expires=Thu, 01 Jan 2099 00:00:01 GMT; path=/; header: Pragma: no-cache header: Cache-Control: no-cache, no-store, must-revalidate header: Expires: Thu, 1 Jan 1970 00:00:00 GMT send: 'POST /auth-rest-api/v1/paywall/login HTTP/1.1\r\nAccept-Encoding: identity\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0\r\nContent-Length: 126\r\nReferer: https://www.irishtimes.com/signin\r\nConnection: close\r\nX-Requested-With: XMLHttpRequest\r\nAccept: */*\r\nHost: www.irishtimes.com\r\nContent-Type: application/x-www-form-urlencoded; charset=UTF-8\r\nCookie: IT_UUID=e23fb6da-be0a-11e6-bd74-005056a02a54\r\nAccept-Language: en-US,en;q=0.5\r\n\r\n{"password": "123456", "deviceid": "53c835787f4d2406131955633c1842d0", "username": "ABCDEF@gmail.com", "persistent": "on"}' reply: 'HTTP/1.1 200 OK\r\n' header: Server: Apache/2.4.10 (Debian) header: Cache-Control: max-age=300 header: Expires: Fri, 09 Dec 2016 12:32:43 GMT header: Content-Type: application/json header: Last-Modified: Fri, 09 Dec 2016 12:27:43 GMT header: Content-Length: 51 header: Accept-Ranges: bytes header: Date: Fri, 09 Dec 2016 12:27:43 GMT header: Connection: keep-alive header: X-Pw-Hits: 0 <response_seek_wrapper at 0x7f650b587f80 whose wrapped object = <closeable_response at 0x7f650b50c638 whose fp = <socket._fileobject object at 0x7f650e597cd0>>> {u'error_number': u'1', u'error_message': u'Login failed'} Thanks, Leo |
12-09-2016, 07:35 AM | #13 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Just noticed that the POST from the Irish Times is using:
Content-Type: application/x-www-form-urlencoded; charset=UTF-8 Whereas the WSJ uses: Content-Type: application/json So looks like I shouldn't be using json stuff! How do I add it instead?? |
12-10-2016, 02:56 PM | #14 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Found an example of a similar login (available on github repo):
calibre/recipes/hbr.recipe Code:
rq = Request(rurl, headers={ 'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Referer': url, 'X-Requested-With': 'XMLHttpRequest', }, data=urlencode({'username': self.username, 'password': self.password,'deviceid':deviceid, 'persistent':'on'})) Regards, Leo |
12-11-2016, 03:43 PM | #15 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
I've put together an improved recipe but still having issues. It successful handles the sigin however when it starts downloading the articles (via RSS) it returns:
Code:
header: X-Pw-Access: anonymous,subscribers.p_1_2901997.news.1..aac.1.1.5 I attach the recipe for anyone interested. Leo |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
The Irish Times - Paywall erected | leo738 | Recipes | 2 | 07-10-2016 03:04 AM |
Updated Irish Times recipe? | leo738 | Recipes | 10 | 04-01-2013 08:13 AM |
Irish Times - Recipe Problem | leo738 | Recipes | 10 | 08-31-2011 12:15 PM |
Irish Times Recipe problem | mbro | Recipes | 3 | 04-16-2011 08:11 AM |
Modified Irish Times Recipe | phiznlil | Recipes | 2 | 04-01-2011 06:27 AM |