[Metadata Source Plugin] Goodreads - Page 16

davidfor · 10-26-2015, 07:41 AM

Thanks for the reports. I'll give it another couple of days before I arrange for the release.

ButtonsMom2003 · 11-17-2015, 12:56 PM

Thank you very much for this. I've been searching off and on for a few weeks to find out why I was no longer getting comments from Goodreads. When I ask Calibre to check for updates it told me there were none available for the Goodreads plugin (not the Goodreads sync one). I found davidfor's file and it fixed the problem. Yay!

seziki · 11-20-2015, 03:38 AM

I have problem with goodreads plugin. I can't download metadata from goodreads. When I try to open book's link through calibre I have error 403. I'm waiting a little It's opening or when I try to download mutliple books metadata I can download 4-5 books metadata and It's giving same error. I'm waiting a little and then I can download 4-5 books metadata.

davidfor · 11-20-2015, 04:43 AM

A 403 error is "Forbidden". Unfortunately, I don't know what is forbidden about the request. But, as you had the error in the browser (clicking the link in calibre) and then in calibre, it isn't something specific to calibre. The only thing I can think of is that goodreads is seeing to many requests from your IP and temporarily blocking them.

When it next happens, can you post the log?

davidfor · 11-20-2015, 04:45 AM

Quote:

Originally Posted by ButtonsMom2003

Thank you very much for this. I've been searching off and on for a few weeks to find out why I was no longer getting comments from Goodreads. When I ask Calibre to check for updates it told me there were none available for the Goodreads plugin (not the Goodreads sync one). I found davidfor's file and it fixed the problem. Yay!

Woops, I have completely forgotten about this. I'll try and arrange to get this and Goodreads Sync updated over the weekend. And maybe finish the changes for FictionDB.

seziki · 11-20-2015, 05:02 AM

Quote:

Originally Posted by davidfor

A 403 error is "Forbidden". Unfortunately, I don't know what is forbidden about the request. But, as you had the error in the browser (clicking the link in calibre) and then in calibre, it isn't something specific to calibre. The only thing I can think of is that goodreads is seeing to many requests from your IP and temporarily blocking them.

When it next happens, can you post the log?

I tried at home I got same error. I download calibre another pc I got same error.I tried different browser. It can be about many request because I'm organising my books. And I'm using turkish calibre and I don't know how can I find log

davidfor · 11-20-2015, 05:45 AM

Quote:

Originally Posted by seziki

I tried at home I got same error. I download calibre another pc I got same error.I tried different browser. It can be about many request because I'm organising my books. And I'm using turkish calibre and I don't know how can I find log

When you get the error, the window should have a button to for "Show log" or "Show Details" or something like that. Press that to see the log or full details of the error. Then press the "Copy to clipboard" button and post the contents of the clipboard.

Are you downloading metadata for one book at a time, or using the bulk download? If you are using the bulk download, can you try the single book?

seziki · 11-20-2015, 06:04 AM

when I click link I'm getting this page. It means 'Access denied to the Web page. You are not authorized.You may need to login.'
I try everything one book, bulk download, re*download calibre, re-download plugin, change browser, change my goodreads user settings.
When I click to link sometimes link begining with http://, sometimes https://
I think I can't solve this problem

Click image for larger version

Name: Ads?z.jpg
Views: 474
Size: 109.1 KB
ID: 143923

davidfor · 11-20-2015, 06:51 AM

Quote:

Originally Posted by seziki

when I click link I'm getting this page. It means 'Access denied to the Web page. You are not authorized.You may need to login.'
I try everything one book, bulk download, re*download calibre, re-download plugin, change browser, change my goodreads user settings.
When I click to link sometimes link begining with http://, sometimes https://
I think I can't solve this problem

Firstly, the URL that calibre builds for Goodreads is HTTP. But, Goodreads is redirecting that to HTTPS. Or at least it is here.

If the error is happening in the browser, it probably means it is something in your connection. That error page doesn't look like it comes from Goodreads. Do you have a proxy between you and the web that you need to login to? Can you reach the Goodreads home page? Is using HTTPS different to HTTP?

Krazykiwi · 11-21-2015, 08:21 AM

This (the 403s) have been happening for the last week, since Friday 13 November. It's a bit of a heisenbug, so I hadn't reported it. In any case here's what I have figured out:

If you send any batch of requests ~>5 at a time, you will probably hit this. Single requests don't ever seem to hit it on first pull of a specific book. The bigger batch you send, the more you'll get: If you request 20 books at a time, in 3 batches, queued into the job manager, you'll probably get all or most of the books in the first batch, 10-12 in the second batch, and none by the third. At this point, all requests to GR fail for the next while - I didn't yet narrow down how long, but it's at least 15 minutes.

Once any specific book has failed, that book id will also pull a 403 in a browser. I've checked multiple browsers. Adding any character to the end of the url (a - will do, gr url's on the site usually include part of the title, but their webserver is actually only responding to the book id, so any text after the book id will work) and it'll work, so it's only the bare book id that fails, it's interpreting any variant as a separate, different request.

All this leads me to believe it's probably something to do with rate-limiting.

As I said, given enough time, any bare book ID url will work again too in the browser, and once it does, you can again pull that book from the plugin.

Just to be super clear what I mean here:
Bare book id url's, for books that were exhibiting this, but naturally aren't now:
http://www.goodreads.com/book/show/6902644
https://www.goodreads.com/book/show/45634
https://www.goodreads.com/book/show/553907

http vs https doesn't seem to matter, I specifically tried both. Once it's blocked it's apparently ip blocking me, because I actually tried one of those url's from my tablet on wifi behind the same router as this pc, so same external ip address) and at the same time from my phone via 3g. Tablet failed, phone worked. ETA: this is specific book id by book id. When one specific one starts failing with 403's, the rest of the site still works, but as I mentioned internal url's on GR are not bare id's, they always include some other text. The fact it starts to decline *all* metadata requests from the plugin after it's failed enough times, is interesting, again implying rate limiting, since it's affecting either only bare book id urls, or only api requests.

adding a - or any other character to those same url's, during the period they were blocked with a 403 response, and they worked fine.

This is really quite hard to debug, since it's so inconsistent, but at the same time it's also quite repeatable, if you throw enough books at GR.

kiwidude · 11-22-2015, 09:36 AM

Changes in this release:

Site changes for the description/comments.

Thanks to davidfor for making the changes.

trying · 12-12-2015, 09:08 PM

As noted by Krazykiwi, you can get 403 errors if you try to download "bare" goodreads book urls too many times. To investigate this issue further, I looked into the details of how the Goodreads Metadata Source Plugin works.

When you first try to download metadata for a book that doesn't have a "goodreads:" (or "isbn:") entry in the identifiers field, the plugin does a goodreads search and then parses the HTML response to get the first matching book's url. This url is not a bare url so it shouldn't trigger the 403 error.

The next time you try to download metadata for that book, it will now have a "goodreads:" identifier, won't do a search, and attempts to get metadata by just directly downloading
www.goodreads.com/book/show/{TheGoodreadsID} (see __init__.py lines 114-115).

I speculate that this problem is more noticeable because the Description/Comments/Summary metadata broke recently and a new plugin version was required. So more people have been re-downloading Goodreads metadata for books that already have a "goodreads:" identifier.

You can fix this problem by changing the identify() method in __init__.py, line 115, to automatically do what Krazykiwi was doing manually. Just add a trailing "-" to the url as in the following:

Code:

  if goodreads_id:
      matches.append('%s/book/show/%s-' % (Goodreads.BASE_URL, goodreads_id))

Optionally, to see when the plugin is actually redoing a search to get a book url, right after __init__.py, line 238:

Code:

   result_url = Goodreads.BASE_URL + first_result_url_node[0]

you can add:

Code:

   log.info('First search results book url: %s' % result_url)

I just got metadata for 290 books and it took 13 minutes, 23 seconds (2.76 seconds per book). 16 books failed to get metadata but they were all "No matches found with query" errors.

The plugin does not use the Goodreads API but is instead scraping the book's html page so it's not limited to 1 request per second. I'm not sure why it's so slow (a custom C# metadata downloader I wrote can grab 500+ books in a few minutes)? I didn't bother to figure this out though since it would probably unfairly load down the goodreads servers.

davidfor · 12-13-2015, 01:39 AM

I doubt adding the dash to the end of the URL will really help. I think it is more likely that when the blocking is happening, the Goodreads site thinks this is different URL. I would expect that it could get blocked and then the URL without the dash would work. Or you would need two dashes. You could automate this, try it and if there was a 403, add the dash and try that. I don't like that as there is reason that Goodreads has blocked the URL and we probably should not sidestep that

But, are people still seeing this problem? At the time it happened, there was another problem with getting related books. I was wondering if both were caused by a bad update to Goodreads.

trying · 12-13-2015, 12:19 PM

Quote:

Originally Posted by davidfor

I doubt adding the dash to the end of the URL will really help. I think it is more likely that when the blocking is happening, the Goodreads site thinks this is different URL. I would expect that it could get blocked and then the URL without the dash would work. Or you would need two dashes. You could automate this, try it and if there was a 403, add the dash and try that. I don't like that as there is reason that Goodreads has blocked the URL and we probably should not sidestep that

But, are people still seeing this problem? At the time it happened, there was another problem with getting related books. I was wondering if both were caused by a bad update to Goodreads.

I got the 403 errors two days ago after I had downloaded metadata using the v1.1.7 version of the plugin, noticed that I was missing Comments, updated to v1.1.10, and then redownloaded metadata. Once goodreads was blocking a bare book url I would, as Krazykiwi mentions, get 403 errors from a browser for a particular bare url and 200 okay when adding the dash. I could repeatedly get a 403 or 200 for that url depending only on whether a dash was added.

To test your theory, I just downloaded metadata for 288 books that already had a "goodreads:" id using my patch. It took 17m:34s (3.7s per book) with no 403 errors but I forget to turn off the Amazon metadata plugin. Redoing again with just the Goodreads plugin took 6m:24s (1.3s per book) but failed for 2 books but they were "No matches found with query" errors.

I then removed my patch, and downloaded the metadata for the same 288 books. It took 3m:38s (0.76s per book) but only successfully downloaded metadata for 57 books, and failed for 231 (with 229 "httperror_seek_wrapper: HTTP Error 403: Forbidden" errors). This matches the initial behavior that caused me to investigate the problem in the first place.

Just to be sure I put my patch back in, redownloaded (6m:20s, 1.3s per book), and again successfully got metadata for all the books for which metadata exists. So it seems your theory is wrong?

Krazykiwi · 12-15-2015, 12:46 PM

Could you make it add any random character at the end? a %s.%random-alpha-char% or %s-%randomalphachar ('scuse my lack of python-fu, but that ought to be a fairly simple little function even if it's not a built-in right?) That would make every request be treated as "the first time".

11-21-2015, 08:21 AM	#235
Krazykiwi Zealot Posts: 137 Karma: 2156958 Join Date: Jan 2013 Device: Too many random androids to list	This (the 403s) have been happening for the last week, since Friday 13 November. It's a bit of a heisenbug, so I hadn't reported it. In any case here's what I have figured out: If you send any batch of requests ~>5 at a time, you will probably hit this. Single requests don't ever seem to hit it on first pull of a specific book. The bigger batch you send, the more you'll get: If you request 20 books at a time, in 3 batches, queued into the job manager, you'll probably get all or most of the books in the first batch, 10-12 in the second batch, and none by the third. At this point, all requests to GR fail for the next while - I didn't yet narrow down how long, but it's at least 15 minutes. Once any specific book has failed, that book id will also pull a 403 in a browser. I've checked multiple browsers. Adding any character to the end of the url (a - will do, gr url's on the site usually include part of the title, but their webserver is actually only responding to the book id, so any text after the book id will work) and it'll work, so it's only the bare book id that fails, it's interpreting any variant as a separate, different request. All this leads me to believe it's probably something to do with rate-limiting. As I said, given enough time, any bare book ID url will work again too in the browser, and once it does, you can again pull that book from the plugin. Just to be super clear what I mean here: Bare book id url's, for books that were exhibiting this, but naturally aren't now: http://www.goodreads.com/book/show/6902644 https://www.goodreads.com/book/show/45634 https://www.goodreads.com/book/show/553907 http vs https doesn't seem to matter, I specifically tried both. Once it's blocked it's apparently ip blocking me, because I actually tried one of those url's from my tablet on wifi behind the same router as this pc, so same external ip address) and at the same time from my phone via 3g. Tablet failed, phone worked. ETA: this is specific book id by book id. When one specific one starts failing with 403's, the rest of the site still works, but as I mentioned internal url's on GR are not bare id's, they always include some other text. The fact it starts to decline all metadata requests from the plugin after it's failed enough times, is interesting, again implying rate limiting, since it's affecting either only bare book id urls, or only api requests. adding a - or any other character to those same url's, during the period they were blocked with a 403 response, and they worked fine. This is really quite hard to debug, since it's so inconsistent, but at the same time it's also quite repeatable, if you throw enough books at GR. Last edited by Krazykiwi; 11-21-2015 at 08:26 AM. Reason: Last friday wasn't the 14th, doh

11-22-2015, 09:36 AM	#236
kiwidude Calibre Plugins Developer Posts: 4,636 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v1.1.0 Released Changes in this release: Site changes for the description/comments. Thanks to davidfor for making the changes.

12-12-2015, 09:08 PM	#237
trying Member Posts: 21 Karma: 104 Join Date: Oct 2013 Device: none	As noted by Krazykiwi, you can get 403 errors if you try to download "bare" goodreads book urls too many times. To investigate this issue further, I looked into the details of how the Goodreads Metadata Source Plugin works. When you first try to download metadata for a book that doesn't have a "goodreads:" (or "isbn:") entry in the identifiers field, the plugin does a goodreads search and then parses the HTML response to get the first matching book's url. This url is not a bare url so it shouldn't trigger the 403 error. The next time you try to download metadata for that book, it will now have a "goodreads:" identifier, won't do a search, and attempts to get metadata by just directly downloading www.goodreads.com/book/show/{TheGoodreadsID} (see __init__.py lines 114-115). I speculate that this problem is more noticeable because the Description/Comments/Summary metadata broke recently and a new plugin version was required. So more people have been re-downloading Goodreads metadata for books that already have a "goodreads:" identifier. You can fix this problem by changing the identify() method in __init__.py, line 115, to automatically do what Krazykiwi was doing manually. Just add a trailing "-" to the url as in the following: Code: if goodreads_id: matches.append('%s/book/show/%s-' % (Goodreads.BASE_URL, goodreads_id)) Optionally, to see when the plugin is actually redoing a search to get a book url, right after __init__.py, line 238: Code: result_url = Goodreads.BASE_URL + first_result_url_node[0] you can add: Code: log.info('First search results book url: %s' % result_url) I just got metadata for 290 books and it took 13 minutes, 23 seconds (2.76 seconds per book). 16 books failed to get metadata but they were all "No matches found with query" errors. The plugin does not use the Goodreads API but is instead scraping the book's html page so it's not limited to 1 request per second. I'm not sure why it's so slow (a custom C# metadata downloader I wrote can grab 500+ books in a few minutes)? I didn't bother to figure this out though since it would probably unfairly load down the goodreads servers.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] Goodreads Sync	kiwidude	Plugins	1721	04-18-2024 10:22 AM
[Metadata Download Plugin] Goodreads Metadata Deprecated	kiwidude	Plugins	30	04-23-2011 02:10 PM
[Covers Plugin] Goodreads Covers Deprecated	kiwidude	Plugins	13	04-17-2011 05:09 PM
metadata plugin	redneck_momma	Plugins	1	05-21-2010 08:41 PM

10-26-2015, 07:41 AM	#226
davidfor Grand Sorcerer Posts: 24,907 Karma: 47303748 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	Thanks for the reports. I'll give it another couple of days before I arrange for the release.

11-17-2015, 12:56 PM	#227
ButtonsMom2003 Junior Member Posts: 5 Karma: 250 Join Date: May 2014 Device: Nook HD	Thank you very much for this. I've been searching off and on for a few weeks to find out why I was no longer getting comments from Goodreads. When I ask Calibre to check for updates it told me there were none available for the Goodreads plugin (not the Goodreads sync one). I found davidfor's file and it fixed the problem. Yay!

11-20-2015, 03:38 AM	#228
seziki Junior Member Posts: 3 Karma: 10 Join Date: Nov 2015 Device: calibro	I have problem with goodreads plugin. I can't download metadata from goodreads. When I try to open book's link through calibre I have error 403. I'm waiting a little It's opening or when I try to download mutliple books metadata I can download 4-5 books metadata and It's giving same error. I'm waiting a little and then I can download 4-5 books metadata.

11-20-2015, 04:43 AM	#229
davidfor Grand Sorcerer Posts: 24,907 Karma: 47303748 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	A 403 error is "Forbidden". Unfortunately, I don't know what is forbidden about the request. But, as you had the error in the browser (clicking the link in calibre) and then in calibre, it isn't something specific to calibre. The only thing I can think of is that goodreads is seeing to many requests from your IP and temporarily blocking them. When it next happens, can you post the log?

11-20-2015, 06:04 AM	#233
seziki Junior Member Posts: 3 Karma: 10 Join Date: Nov 2015 Device: calibro	when I click link I'm getting this page. It means 'Access denied to the Web page. You are not authorized.You may need to login.' I try everything one book, bulk download, re*download calibre, re-download plugin, change browser, change my goodreads user settings. When I click to link sometimes link begining with http://, sometimes https:// I think I can't solve this problem

12-13-2015, 01:39 AM	#238
davidfor Grand Sorcerer Posts: 24,907 Karma: 47303748 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	I doubt adding the dash to the end of the URL will really help. I think it is more likely that when the blocking is happening, the Goodreads site thinks this is different URL. I would expect that it could get blocked and then the URL without the dash would work. Or you would need two dashes. You could automate this, try it and if there was a 403, add the dash and try that. I don't like that as there is reason that Goodreads has blocked the URL and we probably should not sidestep that But, are people still seeing this problem? At the time it happened, there was another problem with getting related books. I was wondering if both were caused by a bad update to Goodreads.

12-15-2015, 12:46 PM	#240
Krazykiwi Zealot Posts: 137 Karma: 2156958 Join Date: Jan 2013 Device: Too many random androids to list	Could you make it add any random character at the end? a %s.%random-alpha-char% or %s-%randomalphachar ('scuse my lack of python-fu, but that ought to be a fairly simple little function even if it's not a built-in right?) That would make every request be treated as "the first time".

Advert

Advert