Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-27-2012, 09:34 PM   #1
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
Script to scrape page for a cover image for recipe?

I'm making a recipe to capture an online periodical. The cover image is always on the front page with a file name like "cover0212_227033.jpg" where the "0212" portion is the month and year and the "_227033" is random. So if it had to use a regular expression, I think the script would need to look for
Code:
cover\d\d\d\d_.*\.jpg
(if that helps).
Is there a way to put this into the Calibre recipe to grab the cover?
Thanks,
Ari
adoucette is offline   Reply With Quote
Old 02-27-2012, 10:54 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Code:
def get_cover_url(self):
    soup = self.index_to_soup(url_of_cover_page)
    cov = soup.find('img', src=re.compile(r'.*cover\d{4}_\d{6}.jpg$'))
    if cov is not None:
         self.cover_url = cov['src']

Last edited by kovidgoyal; 02-28-2012 at 09:37 AM.
kovidgoyal is offline   Reply With Quote
Advert
Old 02-28-2012, 08:27 AM   #3
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
Thanks for the reply Kovid - you have a great product and I've certainly donated.
The regexp works properly, but I get a syntax error when I try to add/update the recipe in Calibre. Here's the syntax I'm using:
Code:
    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')
        if is not None:
            self.cover_url = cov['src']
and the error I get is
Code:
calibre, version 0.8.41
ERROR: Invalid input: <p>Could not create recipe. Error:<br>invalid syntax (<string>, line 13)

Last edited by adoucette; 02-29-2012 at 07:17 AM.
adoucette is offline   Reply With Quote
Old 02-28-2012, 08:56 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
if cov is not None
kovidgoyal is offline   Reply With Quote
Old 02-28-2012, 09:35 AM   #5
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
No, same error - I neglected 'cov' just here on the forum - my bad.
Here's the text of the recipe with the syntax error:
Code:
class AdvancedUserRecipe1330393641(BasicNewsRecipe):
    title          = u'abc'
    oldest_article = 30
    max_articles_per_feed = 100
    auto_cleanup = True
    feeds          = [I took these out for the forum post here to save space]
    def print_version(self, url):
       return url.replace('/article/', '/printarticle/')
    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')
        if cov is not None:
            self.cover_url = cov['src']
and the error I get is
Code:
calibre, version 0.8.41
ERROR: Invalid input: <p>Could not create recipe. Error:<br>invalid syntax (<string>, line 13)

Last edited by adoucette; 02-29-2012 at 07:17 AM.
adoucette is offline   Reply With Quote
Advert
Old 02-28-2012, 09:37 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Missing closing bracket after the soup.find

cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))
kovidgoyal is offline   Reply With Quote
Old 02-28-2012, 10:11 PM   #7
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
That fixes the syntax error, thanks. But it does not download a cover... I still get the default Calibre cover.
Code:
    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))
        if cov is not None:
            self.cover_url = cov['src']
Any thoughts on that? I've tried it with the full url also, as
Code:
cov = soup.find('img', src=re.compile(r'http\S+?cover\w{1,22}\.jpg'))
and still get the same effect

Last edited by adoucette; 02-29-2012 at 07:17 AM.
adoucette is offline   Reply With Quote
Old 02-28-2012, 10:15 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
print out the value of self.cover_url check that it is correct, for example, it might be relative, or the regex might need to be adjusted.
kovidgoyal is offline   Reply With Quote
Old 02-28-2012, 10:42 PM   #9
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
I get this:
Code:
1% Trying to download cover...
At this point in the code the value of variable cov is:  <img alt="" src="http://media.abc.com/images/2012/01/30/cover0212_227033.jpg" />
At this point in the code the value of variable self.cover_url is:  http://media.abc.com/images/2012/01/30/cover0212_227033.jpg
which is a good image.
So is there something I'm missing as to how to actually assign the image as the cover? Is there some obvious thing I've left out of the end of the recipe?
Again, the whole recipe I have so far is: (with feeds list truncated to save space)
Code:
class AdvancedUserRecipe1330393641(BasicNewsRecipe):
    title          = u'abc'
    oldest_article = 30
    max_articles_per_feed = 100
    auto_cleanup = True
    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))
        if cov is not None:
            self.cover_url = cov['src']
        print 'At this point in the code the value of variable cov is: ', cov
        print 'At this point in the code the value of variable self.cover_url is: ', self.cover_url
    feeds          = [(...)]
    def print_version(self, url):
       return url.replace('/article/', '/printarticle/')

Last edited by adoucette; 02-29-2012 at 07:16 AM.
adoucette is offline   Reply With Quote
Old 02-29-2012, 12:01 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Sorry, rather than doing self.url = .. you need return ...
kovidgoyal is offline   Reply With Quote
Old 02-29-2012, 07:19 AM   #11
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
Thanks Kovid, that did solve the issue. The final code for getting the cover is:
Code:
    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cover_item = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))
        if cover_item:
            cover_url = cover_item['src']
        return cover_url
I'll PM you the final recipe if you want it for Calibre.
adoucette is offline   Reply With Quote
Old 02-29-2012, 08:15 AM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you think the recipe is of value to others, post it here, I will add it to calibre automatically.
kovidgoyal is offline   Reply With Quote
Old 02-29-2012, 06:24 PM   #13
adoucette
Member
adoucette doesn't litteradoucette doesn't litter
 
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
Quote:
Originally Posted by kovidgoyal View Post
If you think the recipe is of value to others, post it here, I will add it to calibre automatically.
OK, thanks, I posted it here https://www.mobileread.com/forums/sho...d.php?t=170512 and have edited it with the recent update.
adoucette is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Page blank before and after book image page osiris12 Sigil 12 05-28-2015 04:27 PM
scrape txt jlutes Library Management 1 08-07-2011 09:20 AM
g cover image without losing first page of text fictionaddiction Conversion 1 07-22-2011 02:27 PM
image on separate page without half-page text next Toxaris ePub 2 01-26-2011 03:32 AM
Help with Recipe - Image Sizes Tegan Recipes 10 01-14-2011 03:52 PM


All times are GMT -4. The time now is 01:02 AM.


MobileRead.com is a privately owned, operated and funded community.