MobileRead Forums - View Single Post

isarl · 02-22-2023, 12:47 AM

I spent some time working out how to scrape my Kobo wishlist this evening. It's a bit fragile (I wasn't able to find a reliable and portable way to easily grab browser cookies, but did find a way that works for my setup) but I was able to figure out specifically which URL to query and how to parse the responses.

I will be curious to see in the coming days whether the cookies I currently have hard-baked into my script will expire and need updating. But in the meanwhile Scrapy is allowing me to generate a CSV with all the items on my wishlist.

This is not really in a shareable state yet. However, if you are reasonably comfortable with your browser's dev tools, and with tweaking code, then here's some ground I've already covered to help you replicate my results for your own wishlist. Depends on Scrapy.

Code:

#!/usr/bin/env python
import json
import scrapy

class KoboWishlistSpider(scrapy.Spider):
    name = 'kobowishlist'
    allowed_domains = ['kobo.com']
    magic_request_string = """replace the contents of this triple-quoted string as described below"""

    def parse(self, response):
        data = json.loads(response.body.decode())
        for item in data['Items']:
            yield {'authors': [a['Name'] for a in item['Authors']], 'title': item['Title'], 'price': item['Price'], 'page': data['PageIndex']}
        if data['PageIndex'] < data['TotalNumPages']:
            yield response.request.replace(body='{"pageIndex": ' + f"{data['PageIndex']+1}" + '}')

    def start_requests(self):
        yield scrapy.Request.from_curl(self.magic_request_string).replace(body='{"pageIndex": 1}')

Save this into a file like kobospider.py. In a browser logged into your Kobo account, visit your wishlist and open the developer tools. Click on the Network tab, start recording, and then click one of the buttons at the bottom of your wishlist to change pages. In the list of requests that get recorded in the Network page of the developer tools, look for the request to “fetch”. Right-click > Copy > Copy as cURL. (I believe this works in at least Chrome, Chromium, and Firefox.) Paste this value into the triple quotes for magic_request_string, replacing the placeholder value. You should now be able to run your spider from the commandline like:

Code:

scrapy runspider kobospider.py -O items.csv

I would be much happier if this were able to get your login cookies in a much more automatic way. StackOverflow suggested a Python library called browser-cookie3 which aims to do what I need, but when I tried it out in practice, it encountered issues with dbus.

02-22-2023, 12:47 AM	#13
isarl Addict Posts: 287 Karma: 2534928 Join Date: Nov 2022 Location: Canada Device: Kobo Aura 2	I spent some time working out how to scrape my Kobo wishlist this evening. It's a bit fragile (I wasn't able to find a reliable and portable way to easily grab browser cookies, but did find a way that works for my setup) but I was able to figure out specifically which URL to query and how to parse the responses. I will be curious to see in the coming days whether the cookies I currently have hard-baked into my script will expire and need updating. But in the meanwhile Scrapy is allowing me to generate a CSV with all the items on my wishlist. This is not really in a shareable state yet. However, if you are reasonably comfortable with your browser's dev tools, and with tweaking code, then here's some ground I've already covered to help you replicate my results for your own wishlist. Depends on Scrapy. Code: #!/usr/bin/env python import json import scrapy class KoboWishlistSpider(scrapy.Spider): name = 'kobowishlist' allowed_domains = ['kobo.com'] magic_request_string = """replace the contents of this triple-quoted string as described below""" def parse(self, response): data = json.loads(response.body.decode()) for item in data['Items']: yield {'authors': [a['Name'] for a in item['Authors']], 'title': item['Title'], 'price': item['Price'], 'page': data['PageIndex']} if data['PageIndex'] < data['TotalNumPages']: yield response.request.replace(body='{"pageIndex": ' + f"{data['PageIndex']+1}" + '}') def start_requests(self): yield scrapy.Request.from_curl(self.magic_request_string).replace(body='{"pageIndex": 1}') Save this into a file like kobospider.py. In a browser logged into your Kobo account, visit your wishlist and open the developer tools. Click on the Network tab, start recording, and then click one of the buttons at the bottom of your wishlist to change pages. In the list of requests that get recorded in the Network page of the developer tools, look for the request to “fetch”. Right-click > Copy > Copy as cURL. (I believe this works in at least Chrome, Chromium, and Firefox.) Paste this value into the triple quotes for magic_request_string, replacing the placeholder value. You should now be able to run your spider from the commandline like: Code: scrapy runspider kobospider.py -O items.csv I would be much happier if this were able to get your login cookies in a much more automatic way. StackOverflow suggested a Python library called browser-cookie3 which aims to do what I need, but when I tried it out in practice, it encountered issues with dbus. Last edited by isarl; 02-22-2023 at 12:58 AM. Reason: force script to start on first page of wishlist, regardless of which page is loaded by the copied request cURL string