03-10-2010, 08:29 PM | #1 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Simulate Calibre recipe with browser?
I've got a recipe that works, but it pulls an image that is slightly different from the image I see when I go to the page the image came from. The log shows that the recipe is actually fetching the image I want, but Calibre doesn't get that image. It gets one similar, but different. When I use a browser to directly view the URL of the image that the recipe is fetching, I see the image I want, not the one that Calibre gets.
I've tried turning off cookies in the browser, clearing cookies, clearing the cache and fetching the image again, but I always get the image I want in the browser, and never get the image that Calibre's recipe gets. I've tried changing the useragent string, blocking the referrer, etc., but I can't seem to simulate Calibre with the browser closely enough that I get the same image that Calibre gets. What am I missing? How does the site know that a Calibre recipe is grabbing the image? Comments? Thanks. |
03-11-2010, 12:26 AM | #2 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Try using the TamperData firefox extension to see exactly what happens when you fetch with a browser.
|
Advert | |
|
03-11-2010, 10:59 AM | #3 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I've looked with Live HTTP Headers, but since I'm not sure what's happening with Calibre, it's hard to spot a difference. I suppose I could set up a packet sniffer, but that's a bit more effort than I want to expend. Alternatively, I may just use wget to see whether it pulls the same images that the browser pulls or the images that Calibre gets. That may give me a clue.
|
03-11-2010, 11:53 AM | #4 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
you can have claibre dump the eaders of the requests it sends as well. I don't recall the exact commands for that off the top of my head, but just google python mechanize
|
03-11-2010, 04:01 PM | #5 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
Advert | |
|
03-12-2010, 02:05 PM | #6 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I had tried blocking the referrer with the RefControl plugin, but it turns out that Firefox (or the plugin) will still send the referrer unless you shut it down first. That's why I was having trouble getting Firefox to emulate Calibre's recipe and that was tricky part #1. The second tricky part was that TamperData seems to lie about the referrer. Apparently, it was showing the referrer FF would have sent, if not for the blocking of RefControl. Live HTTP Headers, however, was showing what was actually being sent. For Firefox to get the same images that Calibre was getting, I had to clear the cache, <block> referrer with RefControl, then close FF and restart. (I was also removing cookies, but I'm not sure if that was necessary). To see what was really happening in FF, I had to watch with Live HTTP Headers. What I'm not sure about is what referrer, if any, Calibre sends as a default. I haven't yet figured out how to watch the handshaking with mechanize. I tried some get_browser mods in the recipe to use the correct referrer, but so far it hasn't worked. |
|
03-12-2010, 02:14 PM | #7 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
When downloading articles it doesn't send any referrer. Each request leaves the browser state unchanged (this is so that the download can happen in multiple threads while using the same browser instance).
One possibility is to monkey path the open_novisit method on the browser instance to send the required referrer. so something like this Code:
def get_browser(self): br = BasicNewsRecipe.get_browser(self) orig_open_novisit = br.open_novisit def my_open_no_visit(self, url, **kwargs): data = # add the referrer to the header return orig_open_novisit(url, data=data) br.open_novisit = my_open_no_visit return br |
03-13-2010, 04:42 PM | #8 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
def my_open_no_visit(self, url, **kwargs): had to be changed to: def my_open_no_visit(url, **kwargs): (complaints about number of arguments), and the lines: data = # add the referrer to the header return orig_open_novisit(url, data=data) were changed to : req = mechanize.Request(url, headers = {'Referer':'http://referer_site.com/'}) return orig_open_novisit(req) At least I got a chance to learn a bit more about mechanize. Thanks again for the tip, and enjoy your return home. Last edited by Starson17; 03-13-2010 at 05:33 PM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New to Calibre - Recipe/HTML question | ClairePMR | Calibre | 3 | 07-23-2010 11:53 AM |
NY Times Recipe in Calibre 6.36 Fails | keyrunner | Calibre | 1 | 01-28-2010 11:56 AM |
Broken SMH recipe in new Calibre | AprilHare | Calibre | 1 | 09-20-2008 11:15 AM |
[calibre] recipe - smaller font? | moneytoo | Calibre | 0 | 06-01-2008 08:00 AM |
Calibre recipe Question | astrodad | Calibre | 3 | 05-23-2008 01:05 PM |