metadata/sources/amazon.py: cover_parse, suggestions

tuxor · 03-15-2013, 04:34 PM

For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book:

http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1

At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this:

Code:

    def parse_cover(self, root):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if imgs:
            src = imgs[0].get('src')
            if '/no-image-avail' not in src:
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = sparts[0] + sparts[-1]
                        return ('/'.join(parts[:-1]))+'/'+bn

But in some cases (like with the above example), this will only return amazon's load indicator: http://g-ecx.images-amazon.com/image...192546226_.gif

The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code:

Code:

    def parse_cover(self, root, raw=""):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if not imgs:
            imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]')
        if imgs:
            src = imgs[0].get('src')
            if 'loading-' in src:
                js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw)
                if js_img:
                    src = js_img.group(1)
            if '/no-image-avail' not in src \
               and 'loading-' not in src:
                self.log('Found image: %s' % src)
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = (sparts[0] + sparts[-1]).replace("..",".")
                        return ('/'.join(parts[:-1]))+'/'+bn

Please note, that we have to pass the raw html code to "self.parse_cover" (called in parse_details, line 306) in order to be able to inspect the javascript code.

I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment).

Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"):

Code:

http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg

With my code it will look like this:

Code:

http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg

Would somebody from the dev team have a look at this please and commit if he is fine with the changes. Thanks.

kovidgoyal · 03-16-2013, 01:33 AM

Merged.

I'm not comfortable with the replace('..', '.') there may be other palces in the url where .. is needed and since the .. does not cause any actual problems, its safer to leave it in.

tuxor · 03-16-2013, 04:19 AM

Quote:

Originally Posted by kovidgoyal

Merged.

Thanks.

Quote:

Originally Posted by kovidgoyal

I'm not comfortable with the replace('..', '.') there may be other palces in the url where .. is needed and since the .. does not cause any actual problems, its safer to leave it in.

I understand your concerns, but yes, it does cause actual problems. Look at this:

Code:

http://g-ecx.images-amazon.com/images/G/03/ciu/80/50/1b46f96642a0b333ce906110.L._AA300_.jpg
becomes
http://g-ecx.images-amazon.com/images/G/03/ciu/80/50/1b46f96642a0b333ce906110.L..jpg

But the link with the double points doesn't work!

Please do at least something like this:

Code:

                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = (sparts[0] + sparts[-1]).replace("..jpg",".jpg")
                        return ('/'.join(parts[:-1]))+'/'+bn

EDIT: By the way, this is that book's amazon details page: http://www.amazon.de/Bertelsmann-Jug...3422572&sr=8-1

kovidgoyal · 03-16-2013, 12:44 PM

OK .

03-15-2013, 04:34 PM	#1
tuxor Addict Posts: 320 Karma: 99999 Join Date: Oct 2011 Location: Germany Device: Onyx Boox M92, Icarus Illumina E653	metadata/sources/amazon.py: cover_parse, suggestions For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book: http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1 At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this: Code: def parse_cover(self, root): imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]') if imgs: src = imgs[0].get('src') if '/no-image-avail' not in src: parts = src.split('/') if len(parts) > 3: bn = parts[-1] sparts = bn.split('_') if len(sparts) > 2: bn = sparts[0] + sparts[-1] return ('/'.join(parts[:-1]))+'/'+bn But in some cases (like with the above example), this will only return amazon's load indicator: http://g-ecx.images-amazon.com/image...192546226_.gif The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code: Code: def parse_cover(self, root, raw=""): imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]') if not imgs: imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]') if imgs: src = imgs[0].get('src') if 'loading-' in src: js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw) if js_img: src = js_img.group(1) if '/no-image-avail' not in src \ and 'loading-' not in src: self.log('Found image: %s' % src) parts = src.split('/') if len(parts) > 3: bn = parts[-1] sparts = bn.split('_') if len(sparts) > 2: bn = (sparts[0] + sparts[-1]).replace("..",".") return ('/'.join(parts[:-1]))+'/'+bn Please note, that we have to pass the raw html code to "self.parse_cover" (called in parse_details, line 306) in order to be able to inspect the javascript code. I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment). Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"): Code: http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg becomes http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg With my code it will look like this: Code: http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg becomes http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg Would somebody from the dev team have a look at this please and commit if he is fine with the changes. Thanks.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom columns and metadata sources	kiwidude	Development	30	06-05-2014 01:36 AM
Metadata sources	PhM	Library Management	1	05-13-2011 11:58 AM
Metadata sources	pappcam	Plugins	0	02-21-2011 09:36 PM
Italian metadata sources	giovpres	Calibre	1	12-02-2010 09:07 AM
Suggestions for better metadata editing	Coleccionista	Calibre	3	11-27-2010 05:32 PM

03-16-2013, 01:33 AM	#2
kovidgoyal creator of calibre Posts: 43,826 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Merged. I'm not comfortable with the replace('..', '.') there may be other palces in the url where .. is needed and since the .. does not cause any actual problems, its safer to leave it in.

03-16-2013, 12:44 PM	#4
kovidgoyal creator of calibre Posts: 43,826 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	OK .

Advert