03-01-2011, 10:06 PM | #31 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sure go ahead, I'll be interested to see what you come up with.
I'm afraid I've only ever use addheaders, if that isn't working, I have no clue what else you could do. |
03-01-2011, 10:18 PM | #32 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
|
Advert | |
|
03-01-2011, 10:34 PM | #33 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Looking at the mechanize source code, all you have to do is construct a Request object and manually add the content-type header to it. If the request object has the content-type header it will not be overridden
|
03-01-2011, 11:05 PM | #34 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Code:
br = browser() req = mechanize.Request(URL) req.add_header('CALIBRE_VERSION', __version__) req.add_header('CALIBRE_OS', 'win' if iswindows else 'osx' if isosx else 'oth') req.add_header('CALIBRE_INSTALL_UUID', prefs['installation_uuid']) version = br.open(req).read().strip() |
|
03-01-2011, 11:11 PM | #35 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
yes .
|
Advert | |
|
03-02-2011, 03:24 AM | #36 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
That did the trick for the JSON Query, next and hopefully final major stumbling block.
Edit, I think maybe the best way to fix the problem below is to delete the last cookie in the cookiejar, br._ua_handlers['_cookies'].cookiejar. Looks like this printed as a string: Code:
<cookielib.CookieJar[<Cookie ASP.NET_SessionId=jfvfj1554sbio555e3nrfwjd for search.overdrive.com/>, <Cookie expires=1298969952 for search.overdrive.com/>]> The other option would be to create a separate copy of the cookiejar and use a separate browser object to load the bad page. But I'm struggling to figure out how to duplicate a cookiejar object as well. I've got the separate page loader working with urllib2. == original description == Weird problem, not sure how to fix it. Basically one of the pages I have to retrieve sets a cookie with no name: Code:
Set-Cookie: ; expires=Tue, 01-Mar-2011 08:15:21 GMT; path=/ Code:
Traceback (most recent call last): File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 112, in to_ovrdrv_data ovrdrv_data = find_ovrdrv_data(br, title, author, isbn) File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 95, in find_ovrdrv_data return overdrive_search(br, q, title, author) File "/Users/ldolse/calibredev/heuristics/src/calibre/ebooks/metadata/overdrive.py", line 53, in overdrive_search raw = br.open_novisit(xreq).read() File "site-packages/mechanize/_mechanize.py", line 199, in open_novisit File "site-packages/mechanize/_mechanize.py", line 230, in _mech_open File "site-packages/mechanize/_opener.py", line 188, in open File "site-packages/mechanize/_urllib2_fork.py", line 1188, in http_request File "lib/python2.7/cookielib.py", line 1331, in add_cookie_header File "lib/python2.7/cookielib.py", line 1290, in _cookie_attrs TypeError: expected string or buffer At least that's my assumption - this is the only page that sets a cookie header like that, and it sets it for a regular browser as well - it's not related to the plugin. Any way to get Mechanize to ignore the garbage set-cookie header? Last edited by ldolse; 03-02-2011 at 06:28 AM. |
03-02-2011, 07:01 AM | #37 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I found a solution, not sure if it's the best one, but it's working. Figured out how to initialize a new cookiejar, copied the good cookie into that. Opened the bad page (corrupting the cookiejar) and replaced the corrupted cookiejar with the clean one.
Code:
import copy goodcookies = br._ua_handlers['_cookies'].cookiejar clean_cj = mechanize.CookieJar() cookies_to_copy = [] for cookie in goodcookies: copied_cookie = copy.deepcopy(cookie) cookies_to_copy.append(copied_cookie) for copied_cookie in cookies_to_copy: clean_cj.set_cookie(copied_cookie) # request that corrupts the cookiejar br.open(q_init_search) br.set_cookiejar(clean_cj) Last edited by ldolse; 03-03-2011 at 01:33 AM. Reason: latest fix |
03-02-2011, 11:26 AM | #38 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Why not just set a new cookiejar on the browser object with set_cookiejar?
|
03-02-2011, 11:37 AM | #39 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Primarily because I didn't see that example in the Googling I was doing for possible solutions - 'copy cookiejar', 'new cookiejar', 'initialize', etc didn't return useful results. I'm not sure it would result in much less code though - part of what needs to happen is that the original session cookie needs to be maintained across all the requests. So it would still need to be copied into the new cookiejar. That said, I think that should let me avoid importing urllib2, so I'll give it a shot.
|
03-02-2011, 11:40 AM | #40 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You dont want to use urllib2 as the calibre browser object automatically supports proxies and various other niceties.
|
03-02-2011, 11:44 AM | #41 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Yeah - set_cookiejar worked fine - only eliminated one line of code, but it does let me re-use the browser session and avoid urllib2.
|
03-02-2011, 03:09 PM | #42 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I don't know if you will find it to be of any value, but defining Request objects, using cookiejars, addheader and add_header are used in a variety of recipes. Off the top of my head, the Economist and my Skeptic and GoComic recipes do some of those things.
|
03-02-2011, 08:31 PM | #43 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I had assumed my searches of the source tree were including recipes when I was working through it. You just prompted me to double-check and I see the .recipe extension wasn't considered a text file type to search... I've fixed that, and I do see some useful examples there for general scraping code now. A couple initialize their own cookie jars, but apparently this website is fairly unique in it's ability to trip up mechanize, because none delete cookies or manipulate them the way I'm trying to do.
|
03-02-2011, 09:11 PM | #44 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I'm nearly done with the plugin now, basically just need to clean things up for more robust string handling.
However what I've got working makes me wonder whether I should drop all the work I did with the library scraping. Basically three http requests directly to overdrive.com provides a list object that contains Title, Author, Series info, Publisher, Cover URL, Overdrive ID, ebook edition ISBN, and more. The plugin doesn't work off of ISBN, it can't really, as Googlebooks/ISBNDB only seem to provide ISBNs for printed editions. Thus far in my testing there has never been an ISBN in their databases which matches the Overdrive ebook edition ISBN - I'm thinking now that this makes the xisbn cross referencing moot, correct? In that case Amazon's ASIN to ISBN combo matches one of the Googlebooks/ISBNDB records, but in this case it never does. Since finding the record relies on Title/Author, and returns a fairly comprehensive list of Metadata, would this plugin be more appropriate to use in the discovery phase? Last edited by ldolse; 03-02-2011 at 09:19 PM. |
03-02-2011, 10:39 PM | #45 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yes it sounds like a good match for discovery. I'd suggest you hold on for a bit, one of the goals of the new metadata infrastructure is to support the case of ebooks with a dedicated/no ISBN.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Covers Plugin] Goodreads Covers **Deprecated** | kiwidude | Plugins | 13 | 04-17-2011 05:09 PM |
Is all Overdrive the same? | CWatkinsNash | General Discussions | 3 | 12-28-2010 04:01 PM |
Covers, covers and damn statistics (wait, I got that wrong). | Moejoe | Writers' Corner | 86 | 11-29-2010 08:34 PM |
Stop Using Overdrive | Fat Abe | General Discussions | 19 | 09-11-2010 08:30 PM |
Overdrive Overseas | Honch | Which one should I buy? | 3 | 12-08-2009 08:21 AM |