For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book:
http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1
At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this:
Code:
def parse_cover(self, root):
imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
if imgs:
src = imgs[0].get('src')
if '/no-image-avail' not in src:
parts = src.split('/')
if len(parts) > 3:
bn = parts[-1]
sparts = bn.split('_')
if len(sparts) > 2:
bn = sparts[0] + sparts[-1]
return ('/'.join(parts[:-1]))+'/'+bn
But in some cases (like with the above example), this will only return amazon's load indicator:
http://g-ecx.images-amazon.com/image...192546226_.gif
The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code:
Code:
def parse_cover(self, root, raw=""):
imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
if not imgs:
imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]')
if imgs:
src = imgs[0].get('src')
if 'loading-' in src:
js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw)
if js_img:
src = js_img.group(1)
if '/no-image-avail' not in src \
and 'loading-' not in src:
self.log('Found image: %s' % src)
parts = src.split('/')
if len(parts) > 3:
bn = parts[-1]
sparts = bn.split('_')
if len(sparts) > 2:
bn = (sparts[0] + sparts[-1]).replace("..",".")
return ('/'.join(parts[:-1]))+'/'+bn
Please note, that we have to pass the raw html code to "self.parse_cover" (called in parse_details, line 306) in order to be able to inspect the javascript code.
I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment).
Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"):
Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg
With my code it will look like this:
Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg
Would somebody from the dev team have a look at this please and commit if he is fine with the changes. Thanks.