MobileRead Forums - View Single Post - metadata/sources/amazon.py: cover_parse, suggestions

tuxor · 03-15-2013, 04:34 PM

For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book:

http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1

At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this:

Code:

    def parse_cover(self, root):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if imgs:
            src = imgs[0].get('src')
            if '/no-image-avail' not in src:
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = sparts[0] + sparts[-1]
                        return ('/'.join(parts[:-1]))+'/'+bn

But in some cases (like with the above example), this will only return amazon's load indicator: http://g-ecx.images-amazon.com/image...192546226_.gif

The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code:

Code:

    def parse_cover(self, root, raw=""):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if not imgs:
            imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]')
        if imgs:
            src = imgs[0].get('src')
            if 'loading-' in src:
                js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw)
                if js_img:
                    src = js_img.group(1)
            if '/no-image-avail' not in src \
               and 'loading-' not in src:
                self.log('Found image: %s' % src)
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = (sparts[0] + sparts[-1]).replace("..",".")
                        return ('/'.join(parts[:-1]))+'/'+bn

Please note, that we have to pass the raw html code to "self.parse_cover" (called in parse_details, line 306) in order to be able to inspect the javascript code.

I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment).

Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"):

Code:

http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg

With my code it will look like this:

Code:

http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg

Would somebody from the dev team have a look at this please and commit if he is fine with the changes. Thanks.

03-15-2013, 04:34 PM	#1
tuxor Addict Posts: 320 Karma: 99999 Join Date: Oct 2011 Location: Germany Device: Onyx Boox M92, Icarus Illumina E653	metadata/sources/amazon.py: cover_parse, suggestions For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book: http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1 At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this: Code: def parse_cover(self, root): imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]') if imgs: src = imgs[0].get('src') if '/no-image-avail' not in src: parts = src.split('/') if len(parts) > 3: bn = parts[-1] sparts = bn.split('_') if len(sparts) > 2: bn = sparts[0] + sparts[-1] return ('/'.join(parts[:-1]))+'/'+bn But in some cases (like with the above example), this will only return amazon's load indicator: http://g-ecx.images-amazon.com/image...192546226_.gif The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code: Code: def parse_cover(self, root, raw=""): imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]') if not imgs: imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]') if imgs: src = imgs[0].get('src') if 'loading-' in src: js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw) if js_img: src = js_img.group(1) if '/no-image-avail' not in src \ and 'loading-' not in src: self.log('Found image: %s' % src) parts = src.split('/') if len(parts) > 3: bn = parts[-1] sparts = bn.split('_') if len(sparts) > 2: bn = (sparts[0] + sparts[-1]).replace("..",".") return ('/'.join(parts[:-1]))+'/'+bn Please note, that we have to pass the raw html code to "self.parse_cover" (called in parse_details, line 306) in order to be able to inspect the javascript code. I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment). Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"): Code: http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg becomes http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg With my code it will look like this: Code: http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg becomes http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg Would somebody from the dev team have a look at this please and commit if he is fine with the changes. Thanks.