Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-24-2023, 09:26 PM   #31
mjfriedman
Connoisseur
mjfriedman began at the beginning.
 
Posts: 64
Karma: 10
Join Date: Dec 2010
Device: Kindle Oasis
But with the lead story entirely missing…
mjfriedman is offline   Reply With Quote
Old 10-24-2023, 09:28 PM   #32
mjfriedman
Connoisseur
mjfriedman began at the beginning.
 
Posts: 64
Karma: 10
Join Date: Dec 2010
Device: Kindle Oasis
Sorry. Some stories come through correctly others return an empty item entitled “Too many requests “
mjfriedman is offline   Reply With Quote
Old 11-20-2023, 05:27 AM   #33
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
And just for posterity, this is how far I got with reversing the JS. I can extract the encrypted key and iv and encrypted data, the problem is in get_decryption_key() for some reason the wsj server isnt returning the decrupted key. The same request in a browser works, so I am guessing htere is some cookie missing or the server does some tls sniffing.

Code:
from html5_parser import parse
import json
from calibre import browser
from mechanize import Request
from urllib.parse import urlparse


def extract_json_data(raw_html):
    from pprint import pprint
    pprint
    root = parse(raw_html)
    d = json.loads(root.xpath('//script[@id="__NEXT_DATA__"]')[0].text)
    page_props = d['props']['pageProps']
    ed = page_props['encryptedDataHash']
    encrypted_data = ed['content']
    iv = ed['iv']
    encrypted_key = page_props['encryptedDocumentKey']
    url = root.xpath('//link[@rel="canonical"]')[0].get('href')
    return {'url': url, 'encrypted_data': encrypted_data, 'iv': iv, 'encrypted_key': encrypted_key}


def get_browser_for_wsj(*a, **kw):
    br = browser()
    br.set_cookie('wsjregion', 'na,us', '.wsj.com')
    br.set_cookie('gdprApplies', 'false', '.wsj.com')
    br.set_cookie('ccpaApplies', 'false', '.wsj.com')
    br.set_cookie('vcdpaApplies', 'false', '.wsj.com')
    br.set_cookie('regulationApplies', 'gdpr%3Afalse%2Ccpra%3Afalse%2Cvcdpa%3Afalse', '.wsj.com')
    br.set_handle_gzip(True)
    br.addheaders += [
        ('Accept', '*/*'),
        ('Accept-Language', 'en-GB,en-US;q=0.9,en;q=0.8'),
    ]
    return br


def get_decryption_key(br, data, referer):
    from pprint import pprint
    pprint
    purl = urlparse(referer)
    rq = Request('https://www.wsj.com/client', headers={
        'Cache-Control': 'max-age=0',
        'Referer': referer,
        'X-Encrypted-Document-Key': data['encrypted_key'],
        'X-Original-Host': 'www.wsj.com',
        'X-Original-Referrer': '',
        'X-Original-Url': purl.path,
    })
    br.set_debug_http(True)
    try:
        res = br.open(rq)
    except Exception as err:
        if hasattr(err, 'read'):
            raise Exception('decryption key request failed with error: {} and body: {}'.format(err, err.read().decode('utf-8', 'replace')))
        raise
    if res.code != 200:
        raise ValueError(f'decryption key request returned non OK HTTP result code: {res.code}')
    r = json.loads(res.read())
    key = r['documentKey']
    if not key:
        pprint(r)
        raise ValueError('No document key returned')


def get_wsj_article(url='https://www.wsj.com/world/middle-east/u-n-world-leaders-push-to-get-gaza-aid-flowing-after-biden-pledge-3b59283b'):
    br = get_browser_for_wsj()
    res = br.open(url)
    raw_html = res.read()
    data = extract_json_data(raw_html)
    get_decryption_key(br, data, res.geturl())



if __name__ == '__main__':
    get_wsj_article()
kovidgoyal is online now   Reply With Quote
Old 11-20-2023, 01:47 PM   #34
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 442
Karma: 82686
Join Date: May 2021
Device: kindle
in get_decryption_key (line 43) try 'Referer': 'https://www.drudgereport.com/'

Was able to get documentKey successfully

Last edited by unkn0wn; 11-20-2023 at 01:52 PM.
unkn0wn is offline   Reply With Quote
Old 11-21-2023, 12:24 AM   #35
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Yeah, that works but now the issue is how to decrypt using the key and iv, the obvious candidate, AES-CTR doesnt seem to work

Code:
import base64
import json
from html5_parser import parse
from mechanize import Request
from urllib.parse import urlparse

from calibre import browser


def extract_json_data(raw_html):
    from pprint import pprint
    pprint
    root = parse(raw_html)
    d = json.loads(root.xpath('//script[@id="__NEXT_DATA__"]')[0].text)
    page_props = d['props']['pageProps']
    ed = page_props['encryptedDataHash']
    encrypted_data = base64.standard_b64decode(ed['content'])
    iv = base64.standard_b64decode(ed['iv'])
    encrypted_key = page_props['encryptedDocumentKey']
    url = root.xpath('//link[@rel="canonical"]')[0].get('href')
    return {'url': url, 'encrypted_data': encrypted_data, 'iv': iv, 'encrypted_key': encrypted_key}


def get_browser_for_wsj(*a, **kw):
    br = browser()
    br.set_cookie('wsjregion', 'na,us', '.wsj.com')
    br.set_cookie('gdprApplies', 'false', '.wsj.com')
    br.set_cookie('ccpaApplies', 'false', '.wsj.com')
    br.set_cookie('vcdpaApplies', 'false', '.wsj.com')
    br.set_cookie('regulationApplies', 'gdpr%3Afalse%2Ccpra%3Afalse%2Cvcdpa%3Afalse', '.wsj.com')
    br.set_handle_gzip(True)
    br.addheaders += [
        ('Accept', '*/*'),
        ('Accept-Language', 'en-GB,en-US;q=0.9,en;q=0.8'),
    ]
    return br


def get_decryption_key(br, data, referer='https://www.drudgereport.com/'):
    from pprint import pprint
    pprint
    purl = urlparse(referer)
    rq = Request('https://www.wsj.com/client', headers={
        'Cache-Control': 'max-age=0',
        'Referer': referer,
        'X-Encrypted-Document-Key': data['encrypted_key'],
        'X-Original-Host': 'www.wsj.com',
        'X-Original-Referrer': '',
        'X-Original-Url': purl.path,
    })
    br.set_debug_http(True)
    try:
        res = br.open(rq)
    except Exception as err:
        if hasattr(err, 'read'):
            raise Exception('decryption key request failed with error: {} and body: {}'.format(err, err.read().decode('utf-8', 'replace')))
        raise
    if res.code != 200:
        raise ValueError(f'decryption key request returned non OK HTTP result code: {res.code}')
    r = json.loads(res.read())
    key = r['documentKey']
    if not key:
        pprint(r)
        raise ValueError('No document key returned')
    return base64.standard_b64decode(key)


def decrypt_article(data):
    from Crypto.Cipher import AES
    from Crypto.Util import Counter
    ciphertext = data['encrypted_data']
    # ciphertext += b'\0' * (16 - len(ciphertext) % 16)
    print(11111111, len(ciphertext), len(data['iv']), int.from_bytes(data['iv']))
    counter = Counter.new(nbits=128, initial_value=int.from_bytes(data['iv']))
    cipher = AES.new(data['key'], AES.MODE_CTR, counter=counter)
    return cipher.decrypt(ciphertext)


def get_wsj_article(url='https://www.wsj.com/world/middle-east/u-n-world-leaders-push-to-get-gaza-aid-flowing-after-biden-pledge-3b59283b'):
    br = get_browser_for_wsj()
    res = br.open(url)
    raw_html = res.read()
    data = extract_json_data(raw_html)
    data['key'] = get_decryption_key(br, data)
    return decrypt_article(data)



if __name__ == '__main__':
    data = get_wsj_article()
    print(data)
    print( b'content' in data)
kovidgoyal is online now   Reply With Quote
Old 11-21-2023, 03:32 AM   #36
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 442
Karma: 82686
Join Date: May 2021
Device: kindle
In get_decryption_key the 'X-Encrypted-Document-Key' : data[' encrypted_key'] should not be base 64 decoded. We will not get the documentKey.

I tried but the decoded output is unreadable, maybe cause it's still 64decoded.

At this point I'm mostly unaware of how decryption works here.

Last edited by unkn0wn; 11-21-2023 at 03:55 AM.
unkn0wn is offline   Reply With Quote
Old 11-21-2023, 05:50 AM   #37
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
yeah it's not base64 decoded, as it has to be sent in a header. As I said, it remains to figure out what decryption algorithm is used, either by stepping through the JS in a debugger or reversing it. From a quick read of the JS it looks like some variant of AES with a 16 byte "iv". I tried a few of the more obvious ones like AES-256-CTR but no luck.
kovidgoyal is online now   Reply With Quote
Old 11-24-2023, 03:56 AM   #38
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 442
Karma: 82686
Join Date: May 2021
Device: kindle
if anyones interested, try to figure out the decryption method used here
Code:
{
    'url': 'https://www.wsj.com/tech/ai/openai-leadership-hangs-in-balance-as-sam-altmans-counte-rebellion-gains-steam-47276fa8',
    'encrypted_data': '',
    'iv': 'J0mre5ohZnHgK/RgHOTYhQ==',
    'encrypted_key': 'TY4XXz7TLdVFkd7pXhRZfqaRLYYdtpyCFrKnKe9EXfvaCfMOPo2dP/kC6TBmdCL7/IT7leMxY05OBv9gQkGVZgqCcI7lTLscMfvhhnmCjieb/NH3qbOkwwD+c0QXYosmf2aKYhUafSozz8ngBg6Q385j9pS36+sEfW6X3vFc/X+khJ7tChceWPIcM1JU8zs99bMomN451Vbhz6vUc+1W0bCk6hJ4yX1WGRlWRbM1vd88pEBmterZN+icij1+2g==',
    'key': 'bnQbZ9urHPWcAMC/RmPO/JyAXfKAGC6Jl7oqEjc1O+k='
}
encrypted_key is encryptedDocumentKey as per the code in wsj
key is documentKey
try to decrypt the encrypted_data to readable text/html.
unkn0wn is offline   Reply With Quote
Reply

Tags
calibre, wsj, wsj.com

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Saving articles from news downloads sparks56 Calibre 1 02-17-2017 09:46 PM
How to insert links between articles? oecherprinte Recipes 3 11-27-2013 04:37 AM
ReadItLater recipe only downloads 10 saved articles? usuario74 Recipes 1 02-20-2011 04:24 PM
calibre only downloads some articles from FT St28 Recipes 0 01-21-2011 09:25 AM
Sharing/saving articles in news downloads for Kindle f1nkster Calibre 4 07-28-2010 01:53 PM


All times are GMT -4. The time now is 11:49 AM.


MobileRead.com is a privately owned, operated and funded community.