Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 03-15-2013, 04:34 PM   #1
tuxor
Addict
tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!
 
Posts: 320
Karma: 99999
Join Date: Oct 2011
Location: Germany
Device: Onyx Boox M92, Icarus Illumina E653
metadata/sources/amazon.py: cover_parse, suggestions

For me, Amazon is the main source for book covers. It's convenient that for old books there are sometimes user contributed covers on amazon, like with this book:

http://www.amazon.de/Hubschrauber-Da...3378637&sr=8-1

At the moment the code for parsing the cover url (in src/calibre/ebooks/metadata/sources/amazon.py) looks like this:
Code:
    def parse_cover(self, root):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if imgs:
            src = imgs[0].get('src')
            if '/no-image-avail' not in src:
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = sparts[0] + sparts[-1]
                        return ('/'.join(parts[:-1]))+'/'+bn
But in some cases (like with the above example), this will only return amazon's load indicator: http://g-ecx.images-amazon.com/image...192546226_.gif

The correct cover image (in the above example) is included using javascript. So we have to search through Amazon's javascript code, as well. I suggest the following improved lines of code:

Code:
    def parse_cover(self, root, raw=""):
        imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image" or @id="main-image") and @src]')
        if not imgs:
            imgs = root.xpath('//div[@class="main-image-inner-wrapper"]/img[@src]')
        if imgs:
            src = imgs[0].get('src')
            if 'loading-' in src:
                js_img = re.search(r'"largeImage":"(http://[^"]+)",',raw)
                if js_img:
                    src = js_img.group(1)
            if '/no-image-avail' not in src \
               and 'loading-' not in src:
                self.log('Found image: %s' % src)
                parts = src.split('/')
                if len(parts) > 3:
                    bn = parts[-1]
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = (sparts[0] + sparts[-1]).replace("..",".")
                        return ('/'.join(parts[:-1]))+'/'+bn
Please note, that we have to pass the raw html code to "self.parse_cover" (called in parse_details, line 306) in order to be able to inspect the javascript code.

I added an additional check for the "main-image-inner-wrapper" because I recently ran into a book page where the "img" didn't have an id (unfortunately, I can't find that page at the moment).

Besides that, I corrected a small issue concerning the url conversion in the last lines: At the moment the conversion is like this (note the "..jpg"):
Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL..jpg
With my code it will look like this:
Code:
http://ecx.images-amazon.com/images/I/21SnyDkzpWL._AA300_.jpg
becomes
http://ecx.images-amazon.com/images/I/21SnyDkzpWL.jpg
Would somebody from the dev team have a look at this please and commit if he is fine with the changes. Thanks.
tuxor is offline   Reply With Quote
Old 03-16-2013, 01:33 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Merged.

I'm not comfortable with the replace('..', '.') there may be other palces in the url where .. is needed and since the .. does not cause any actual problems, its safer to leave it in.
kovidgoyal is offline   Reply With Quote
Advert
Old 03-16-2013, 04:19 AM   #3
tuxor
Addict
tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!tuxor has a thesaurus and is not afraid to use it!
 
Posts: 320
Karma: 99999
Join Date: Oct 2011
Location: Germany
Device: Onyx Boox M92, Icarus Illumina E653
Quote:
Originally Posted by kovidgoyal View Post
Merged.
Thanks.

Quote:
Originally Posted by kovidgoyal View Post
I'm not comfortable with the replace('..', '.') there may be other palces in the url where .. is needed and since the .. does not cause any actual problems, its safer to leave it in.
I understand your concerns, but yes, it does cause actual problems. Look at this:

Code:
http://g-ecx.images-amazon.com/images/G/03/ciu/80/50/1b46f96642a0b333ce906110.L._AA300_.jpg
becomes
http://g-ecx.images-amazon.com/images/G/03/ciu/80/50/1b46f96642a0b333ce906110.L..jpg
But the link with the double points doesn't work!

Please do at least something like this:

Code:
                    sparts = bn.split('_')
                    if len(sparts) > 2:
                        bn = (sparts[0] + sparts[-1]).replace("..jpg",".jpg")
                        return ('/'.join(parts[:-1]))+'/'+bn
EDIT: By the way, this is that book's amazon details page: http://www.amazon.de/Bertelsmann-Jug...3422572&sr=8-1

Last edited by tuxor; 03-16-2013 at 04:29 AM.
tuxor is offline   Reply With Quote
Old 03-16-2013, 12:44 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
OK .
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom columns and metadata sources kiwidude Development 30 06-05-2014 01:36 AM
Metadata sources PhM Library Management 1 05-13-2011 11:58 AM
Metadata sources pappcam Plugins 0 02-21-2011 09:36 PM
Italian metadata sources giovpres Calibre 1 12-02-2010 09:07 AM
Suggestions for better metadata editing Coleccionista Calibre 3 11-27-2010 05:32 PM


All times are GMT -4. The time now is 11:19 PM.


MobileRead.com is a privately owned, operated and funded community.