Using Overdrive for covers? - Page 3

kovidgoyal · 03-01-2011, 10:06 PM

Sure go ahead, I'll be interested to see what you come up with.

I'm afraid I've only ever use addheaders, if that isn't working, I have no clue what else you could do.

ldolse · 03-01-2011, 10:18 PM

Quote:

Originally Posted by kovidgoyal

I'm afraid I've only ever use addheaders, if that isn't working, I have no clue what else you could do.

I'll keep digging then - they also have a legacy search interface that might work, I just liked the JSON option as no scraping was required.

kovidgoyal · 03-01-2011, 10:34 PM

Looking at the mechanize source code, all you have to do is construct a Request object and manually add the content-type header to it. If the request object has the content-type header it will not be overridden

ldolse · 03-01-2011, 11:05 PM

Quote:

Originally Posted by kovidgoyal

Looking at the mechanize source code, all you have to do is construct a Request object and manually add the content-type header to it. If the request object has the content-type header it will not be overridden

By construct a request object, you mean something roughly equivalent to this? (from gui2.update.py):

Code:

                br = browser()
                req = mechanize.Request(URL)
                req.add_header('CALIBRE_VERSION', __version__)
                req.add_header('CALIBRE_OS',
                        'win' if iswindows else 'osx' if isosx else 'oth')
                req.add_header('CALIBRE_INSTALL_UUID', prefs['installation_uuid'])
                version = br.open(req).read().strip()

kovidgoyal · 03-01-2011, 11:11 PM

yes .

ldolse · 03-02-2011, 03:24 AM

That did the trick for the JSON Query, next and hopefully final major stumbling block.

Edit, I think maybe the best way to fix the problem below is to delete the last cookie in the cookiejar, br._ua_handlers['_cookies'].cookiejar. Looks like this printed as a string:

Code:

<cookielib.CookieJar[<Cookie ASP.NET_SessionId=jfvfj1554sbio555e3nrfwjd for search.overdrive.com/>, <Cookie expires=1298969952 for search.overdrive.com/>]>

Not sure how to go about actually doing that though, as it's an instance and not a list object. I tried to use cookielib's clear() function, but it doesn't seem to work, probably because this cookie is corrupted in the first place and doesn't use the structure mechanize/cookielib expects.

The other option would be to create a separate copy of the cookiejar and use a separate browser object to load the bad page. But I'm struggling to figure out how to duplicate a cookiejar object as well. I've got the separate page loader working with urllib2.

== original description ==

Weird problem, not sure how to fix it. Basically one of the pages I have to retrieve sets a cookie with no name:

Code:

Set-Cookie: ; expires=Tue, 01-Mar-2011 08:15:21 GMT; path=/

And this causes mechanize to barf when it moves on to the next request:

Code:

Traceback (most recent call last):
  File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 112, in to_ovrdrv_data
    ovrdrv_data = find_ovrdrv_data(br, title, author, isbn)
  File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 95, in find_ovrdrv_data
    return overdrive_search(br, q, title, author)
  File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 53, in overdrive_search
    raw = br.open_novisit(xreq).read()
  File "site-packages/mechanize/_mechanize.py", line 199, in open_novisit
  File "site-packages/mechanize/_mechanize.py", line 230, in _mech_open
  File "site-packages/mechanize/_opener.py", line 188, in open
  File "site-packages/mechanize/_urllib2_fork.py", line 1188, in http_request
  File "lib/python2.7/cookielib.py", line 1331, in add_cookie_header
  File "lib/python2.7/cookielib.py", line 1290, in _cookie_attrs
TypeError: expected string or buffer

At least that's my assumption - this is the only page that sets a cookie header like that, and it sets it for a regular browser as well - it's not related to the plugin. Any way to get Mechanize to ignore the garbage set-cookie header?

ldolse · 03-02-2011, 07:01 AM

I found a solution, not sure if it's the best one, but it's working. Figured out how to initialize a new cookiejar, copied the good cookie into that. Opened the bad page (corrupting the cookiejar) and replaced the corrupted cookiejar with the clean one.

Code:

    import copy

    goodcookies = br._ua_handlers['_cookies'].cookiejar
    clean_cj = mechanize.CookieJar()
    cookies_to_copy = []
    for cookie in goodcookies:
        copied_cookie = copy.deepcopy(cookie)
        cookies_to_copy.append(copied_cookie)
    for copied_cookie in cookies_to_copy:
        clean_cj.set_cookie(copied_cookie)
    
    # request that corrupts the cookiejar
    br.open(q_init_search)
    
    br.set_cookiejar(clean_cj)

kovidgoyal · 03-02-2011, 11:26 AM

Why not just set a new cookiejar on the browser object with set_cookiejar?

ldolse · 03-02-2011, 11:37 AM

Primarily because I didn't see that example in the Googling I was doing for possible solutions - 'copy cookiejar', 'new cookiejar', 'initialize', etc didn't return useful results. I'm not sure it would result in much less code though - part of what needs to happen is that the original session cookie needs to be maintained across all the requests. So it would still need to be copied into the new cookiejar. That said, I think that should let me avoid importing urllib2, so I'll give it a shot.

kovidgoyal · 03-02-2011, 11:40 AM

You dont want to use urllib2 as the calibre browser object automatically supports proxies and various other niceties.

ldolse · 03-02-2011, 11:44 AM

Yeah - set_cookiejar worked fine - only eliminated one line of code, but it does let me re-use the browser session and avoid urllib2.

Starson17 · 03-02-2011, 03:09 PM

Quote:

Originally Posted by ldolse

Yeah - set_cookiejar worked fine - only eliminated one line of code, but it does let me re-use the browser session and avoid urllib2.

I don't know if you will find it to be of any value, but defining Request objects, using cookiejars, addheader and add_header are used in a variety of recipes. Off the top of my head, the Economist and my Skeptic and GoComic recipes do some of those things.

ldolse · 03-02-2011, 08:31 PM

Quote:

Originally Posted by Starson17

I don't know if you will find it to be of any value, but defining Request objects, using cookiejars, addheader and add_header are used in a variety of recipes. Off the top of my head, the Economist and my Skeptic and GoComic recipes do some of those things.

I had assumed my searches of the source tree were including recipes when I was working through it. You just prompted me to double-check and I see the .recipe extension wasn't considered a text file type to search... I've fixed that, and I do see some useful examples there for general scraping code now. A couple initialize their own cookie jars, but apparently this website is fairly unique in it's ability to trip up mechanize, because none delete cookies or manipulate them the way I'm trying to do.

ldolse · 03-02-2011, 09:11 PM

I'm nearly done with the plugin now, basically just need to clean things up for more robust string handling.

However what I've got working makes me wonder whether I should drop all the work I did with the library scraping. Basically three http requests directly to overdrive.com provides a list object that contains Title, Author, Series info, Publisher, Cover URL, Overdrive ID, ebook edition ISBN, and more.

The plugin doesn't work off of ISBN, it can't really, as Googlebooks/ISBNDB only seem to provide ISBNs for printed editions. Thus far in my testing there has never been an ISBN in their databases which matches the Overdrive ebook edition ISBN - I'm thinking now that this makes the xisbn cross referencing moot, correct? In that case Amazon's ASIN to ISBN combo matches one of the Googlebooks/ISBNDB records, but in this case it never does.

Since finding the record relies on Title/Author, and returns a fairly comprehensive list of Metadata, would this plugin be more appropriate to use in the discovery phase?

kovidgoyal · 03-02-2011, 10:39 PM

Yes it sounds like a good match for discovery. I'd suggest you hold on for a bit, one of the goals of the new metadata infrastructure is to support the case of ebooks with a dedicated/no ISBN.

03-02-2011, 03:24 AM	#36
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	That did the trick for the JSON Query, next and hopefully final major stumbling block. Edit, I think maybe the best way to fix the problem below is to delete the last cookie in the cookiejar, br._ua_handlers['_cookies'].cookiejar. Looks like this printed as a string: Code: <cookielib.CookieJar[<Cookie ASP.NET_SessionId=jfvfj1554sbio555e3nrfwjd for search.overdrive.com/>, <Cookie expires=1298969952 for search.overdrive.com/>]> Not sure how to go about actually doing that though, as it's an instance and not a list object. I tried to use cookielib's clear() function, but it doesn't seem to work, probably because this cookie is corrupted in the first place and doesn't use the structure mechanize/cookielib expects. The other option would be to create a separate copy of the cookiejar and use a separate browser object to load the bad page. But I'm struggling to figure out how to duplicate a cookiejar object as well. I've got the separate page loader working with urllib2. == original description == Weird problem, not sure how to fix it. Basically one of the pages I have to retrieve sets a cookie with no name: Code: Set-Cookie: ; expires=Tue, 01-Mar-2011 08:15:21 GMT; path=/ And this causes mechanize to barf when it moves on to the next request: Code: Traceback (most recent call last): File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 112, in to_ovrdrv_data ovrdrv_data = find_ovrdrv_data(br, title, author, isbn) File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 95, in find_ovrdrv_data return overdrive_search(br, q, title, author) File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 53, in overdrive_search raw = br.open_novisit(xreq).read() File "site-packages/mechanize/_mechanize.py", line 199, in open_novisit File "site-packages/mechanize/_mechanize.py", line 230, in _mech_open File "site-packages/mechanize/_opener.py", line 188, in open File "site-packages/mechanize/_urllib2_fork.py", line 1188, in http_request File "lib/python2.7/cookielib.py", line 1331, in add_cookie_header File "lib/python2.7/cookielib.py", line 1290, in _cookie_attrs TypeError: expected string or buffer At least that's my assumption - this is the only page that sets a cookie header like that, and it sets it for a regular browser as well - it's not related to the plugin. Any way to get Mechanize to ignore the garbage set-cookie header? Last edited by ldolse; 03-02-2011 at 06:28 AM.

03-02-2011, 07:01 AM	#37
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I found a solution, not sure if it's the best one, but it's working. Figured out how to initialize a new cookiejar, copied the good cookie into that. Opened the bad page (corrupting the cookiejar) and replaced the corrupted cookiejar with the clean one. Code: import copy goodcookies = br._ua_handlers['_cookies'].cookiejar clean_cj = mechanize.CookieJar() cookies_to_copy = [] for cookie in goodcookies: copied_cookie = copy.deepcopy(cookie) cookies_to_copy.append(copied_cookie) for copied_cookie in cookies_to_copy: clean_cj.set_cookie(copied_cookie) # request that corrupts the cookiejar br.open(q_init_search) br.set_cookiejar(clean_cj) Last edited by ldolse; 03-03-2011 at 01:33 AM. Reason: latest fix

03-02-2011, 09:11 PM	#44
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I'm nearly done with the plugin now, basically just need to clean things up for more robust string handling. However what I've got working makes me wonder whether I should drop all the work I did with the library scraping. Basically three http requests directly to overdrive.com provides a list object that contains Title, Author, Series info, Publisher, Cover URL, Overdrive ID, ebook edition ISBN, and more. The plugin doesn't work off of ISBN, it can't really, as Googlebooks/ISBNDB only seem to provide ISBNs for printed editions. Thus far in my testing there has never been an ISBN in their databases which matches the Overdrive ebook edition ISBN - I'm thinking now that this makes the xisbn cross referencing moot, correct? In that case Amazon's ASIN to ISBN combo matches one of the Googlebooks/ISBNDB records, but in this case it never does. Since finding the record relies on Title/Author, and returns a fairly comprehensive list of Metadata, would this plugin be more appropriate to use in the discovery phase? Last edited by ldolse; 03-02-2011 at 09:19 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[Covers Plugin] Goodreads Covers Deprecated	kiwidude	Plugins	13	04-17-2011 05:09 PM
Is all Overdrive the same?	CWatkinsNash	General Discussions	3	12-28-2010 04:01 PM
Covers, covers and damn statistics (wait, I got that wrong).	Moejoe	Writers' Corner	86	11-29-2010 08:34 PM
Stop Using Overdrive	Fat Abe	General Discussions	19	09-11-2010 08:30 PM
Overdrive Overseas	Honch	Which one should I buy?	3	12-08-2009 08:21 AM

03-01-2011, 10:06 PM	#31
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Sure go ahead, I'll be interested to see what you come up with. I'm afraid I've only ever use addheaders, if that isn't working, I have no clue what else you could do.

03-01-2011, 10:34 PM	#33
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Looking at the mechanize source code, all you have to do is construct a Request object and manually add the content-type header to it. If the request object has the content-type header it will not be overridden

03-01-2011, 11:11 PM	#35
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	yes .

03-02-2011, 11:26 AM	#38
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Why not just set a new cookiejar on the browser object with set_cookiejar?

03-02-2011, 11:37 AM	#39
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Primarily because I didn't see that example in the Googling I was doing for possible solutions - 'copy cookiejar', 'new cookiejar', 'initialize', etc didn't return useful results. I'm not sure it would result in much less code though - part of what needs to happen is that the original session cookie needs to be maintained across all the requests. So it would still need to be copied into the new cookiejar. That said, I think that should let me avoid importing urllib2, so I'll give it a shot.

03-02-2011, 11:40 AM	#40
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You dont want to use urllib2 as the calibre browser object automatically supports proxies and various other niceties.

03-02-2011, 11:44 AM	#41
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Yeah - set_cookiejar worked fine - only eliminated one line of code, but it does let me re-use the browser session and avoid urllib2.

03-02-2011, 10:39 PM	#45
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yes it sounds like a good match for discovery. I'd suggest you hold on for a bit, one of the goals of the new metadata infrastructure is to support the case of ebooks with a dedicated/no ISBN.

Advert

Advert