03-15-2013, 04:34 PM | #1 |
Addict
Posts: 320
Karma: 99999
Join Date: Oct 2011
Location: Germany
Device: Onyx Boox M92, Icarus Illumina E653
|
metadata/sources/amazon.py: cover_parse, suggestions
For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book:
http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1 At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this: Code:
def parse_cover(self, root): imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]') if imgs: src = imgs[0].get('src') if '/no-image-avail' not in src: parts = src.split('/') if len(parts) > 3: bn = parts[-1] sparts = bn.split('_') if len(sparts) > 2: bn = sparts[0] + sparts[-1] return ('/'.join(parts[:-1]))+'/'+bn The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code: Code:
def parse_cover(self, root, raw=""): imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]') if not imgs: imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]') if imgs: src = imgs[0].get('src') if 'loading-' in src: js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw) if js_img: src = js_img.group(1) if '/no-image-avail' not in src \ and 'loading-' not in src: self.log('Found image: %s' % src) parts = src.split('/') if len(parts) > 3: bn = parts[-1] sparts = bn.split('_') if len(sparts) > 2: bn = (sparts[0] + sparts[-1]).replace("..",".") return ('/'.join(parts[:-1]))+'/'+bn I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment). Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"): Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg becomes http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg becomes http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg |
03-16-2013, 01:33 AM | #2 |
creator of calibre
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Merged.
I'm not comfortable with the replace('..', '.') there may be other palces in the url where .. is needed and since the .. does not cause any actual problems, its safer to leave it in. |
Advert | |
|
03-16-2013, 04:19 AM | #3 | |
Addict
Posts: 320
Karma: 99999
Join Date: Oct 2011
Location: Germany
Device: Onyx Boox M92, Icarus Illumina E653
|
Thanks.
Quote:
Code:
http://g-ecx.images-amazon.com/images/G/03/ciu/80/50/1b46f96642a0b333ce906110.L._AA300_.jpg becomes http://g-ecx.images-amazon.com/images/G/03/ciu/80/50/1b46f96642a0b333ce906110.L..jpg Please do at least something like this: Code:
sparts = bn.split('_') if len(sparts) > 2: bn = (sparts[0] + sparts[-1]).replace("..jpg",".jpg") return ('/'.join(parts[:-1]))+'/'+bn Last edited by tuxor; 03-16-2013 at 04:29 AM. |
|
03-16-2013, 12:44 PM | #4 |
creator of calibre
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
OK .
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom columns and metadata sources | kiwidude | Development | 30 | 06-05-2014 01:36 AM |
Metadata sources | PhM | Library Management | 1 | 05-13-2011 11:58 AM |
Metadata sources | pappcam | Plugins | 0 | 02-21-2011 09:36 PM |
Italian metadata sources | giovpres | Calibre | 1 | 12-02-2010 09:07 AM |
Suggestions for better metadata editing | Coleccionista | Calibre | 3 | 11-27-2010 05:32 PM |