View Single Post
Old 03-15-2013, 04:34 PM   #1
tuxor
Addict
tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!
 
Posts: 320
Karma: 99999
Join Date: Oct 2011
Location: Germany
Device: Onyx Boox M92, Icarus Illumina E653
metadata/sources/amazon.py: cover_parse, suggestions

For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book:

http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1

At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this:
Code:
    def parse_cover(self, root):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if imgs:
            src = imgs[0].get('src')
            if '/no-image-avail' not in src:
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = sparts[0] + sparts[-1]
                        return ('/'.join(parts[:-1]))+'/'+bn
But in some cases (like with the above example), this will only return amazon's load indicator: http://g-ecx.images-amazon.com/image...192546226_.gif

The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code:

Code:
    def parse_cover(self, root, raw=""):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if not imgs:
            imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]')
        if imgs:
            src = imgs[0].get('src')
            if 'loading-' in src:
                js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw)
                if js_img:
                    src = js_img.group(1)
            if '/no-image-avail' not in src \
               and 'loading-' not in src:
                self.log('Found image: %s' % src)
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = (sparts[0] + sparts[-1]).replace("..",".")
                        return ('/'.join(parts[:-1]))+'/'+bn
Please note, that we have to pass the raw html code to "self.parse_cover" (called in parse_details, line 306) in order to be able to inspect the javascript code.

I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment).

Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"):
Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg
With my code it will look like this:
Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg
Would somebody from the dev team have a look at this please and commit if he is fine with the changes. Thanks.
tuxor is offline   Reply With Quote