Script to scrape page for a cover image for recipe?

adoucette · 02-27-2012, 09:34 PM

I'm making a recipe to capture an online periodical. The cover image is always on the front page with a file name like "cover0212_227033.jpg" where the "0212" portion is the month and year and the "_227033" is random. So if it had to use a regular expression, I think the script would need to look for

Code:

cover\d\d\d\d_.*\.jpg

(if that helps).
Is there a way to put this into the Calibre recipe to grab the cover?
Thanks,
Ari

kovidgoyal · 02-27-2012, 10:54 PM

Code:

def get_cover_url(self):
    soup = self.index_to_soup(url_of_cover_page)
    cov = soup.find('img', src=re.compile(r'.*cover\d{4}_\d{6}.jpg$'))
    if cov is not None:
         self.cover_url = cov['src']

adoucette · 02-28-2012, 08:27 AM

Thanks for the reply Kovid - you have a great product and I've certainly donated.
The regexp works properly, but I get a syntax error when I try to add/update the recipe in Calibre. Here's the syntax I'm using:

Code:

    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')
        if is not None:
            self.cover_url = cov['src']

and the error I get is

Code:

calibre, version 0.8.41
ERROR: Invalid input: <p>Could not create recipe. Error:<br>invalid syntax (<string>, line 13)

kovidgoyal · 02-28-2012, 08:56 AM

if cov is not None

adoucette · 02-28-2012, 09:35 AM

No, same error - I neglected 'cov' just here on the forum - my bad.
Here's the text of the recipe with the syntax error:

Code:

class AdvancedUserRecipe1330393641(BasicNewsRecipe):
    title          = u'abc'
    oldest_article = 30
    max_articles_per_feed = 100
    auto_cleanup = True
    feeds          = [I took these out for the forum post here to save space]
    def print_version(self, url):
       return url.replace('/article/', '/printarticle/')
    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')
        if cov is not None:
            self.cover_url = cov['src']

and the error I get is

Code:

calibre, version 0.8.41
ERROR: Invalid input: <p>Could not create recipe. Error:<br>invalid syntax (<string>, line 13)

kovidgoyal · 02-28-2012, 09:37 AM

Missing closing bracket after the soup.find

cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))

adoucette · 02-28-2012, 10:11 PM

That fixes the syntax error, thanks. But it does not download a cover... I still get the default Calibre cover.

Code:

    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))
        if cov is not None:
            self.cover_url = cov['src']

Any thoughts on that? I've tried it with the full url also, as

Code:

cov = soup.find('img', src=re.compile(r'http\S+?cover\w{1,22}\.jpg'))

and still get the same effect

kovidgoyal · 02-28-2012, 10:15 PM

print out the value of self.cover_url check that it is correct, for example, it might be relative, or the regex might need to be adjusted.

adoucette · 02-28-2012, 10:42 PM

I get this:

Code:

1% Trying to download cover...
At this point in the code the value of variable cov is:  <img alt="" src="http://media.abc.com/images/2012/01/30/cover0212_227033.jpg" />
At this point in the code the value of variable self.cover_url is:  http://media.abc.com/images/2012/01/30/cover0212_227033.jpg

which is a good image.
So is there something I'm missing as to how to actually assign the image as the cover? Is there some obvious thing I've left out of the end of the recipe?
Again, the whole recipe I have so far is: (with feeds list truncated to save space)

Code:

class AdvancedUserRecipe1330393641(BasicNewsRecipe):
    title          = u'abc'
    oldest_article = 30
    max_articles_per_feed = 100
    auto_cleanup = True
    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))
        if cov is not None:
            self.cover_url = cov['src']
        print 'At this point in the code the value of variable cov is: ', cov
        print 'At this point in the code the value of variable self.cover_url is: ', self.cover_url
    feeds          = [(...)]
    def print_version(self, url):
       return url.replace('/article/', '/printarticle/')

kovidgoyal · 02-29-2012, 12:01 AM

Sorry, rather than doing self.url = .. you need return ...

adoucette · 02-29-2012, 07:19 AM

Thanks Kovid, that did solve the issue. The final code for getting the cover is:

Code:

    def get_cover_url(self):
        cover_url = None
        soup = self.index_to_soup('http://www.abc.com')
        cover_item = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))
        if cover_item:
            cover_url = cover_item['src']
        return cover_url

I'll PM you the final recipe if you want it for Calibre.

kovidgoyal · 02-29-2012, 08:15 AM

If you think the recipe is of value to others, post it here, I will add it to calibre automatically.

adoucette · 02-29-2012, 06:24 PM

Quote:

Originally Posted by kovidgoyal

If you think the recipe is of value to others, post it here, I will add it to calibre automatically.

OK, thanks, I posted it here https://www.mobileread.com/forums/sho...d.php?t=170512 and have edited it with the recent update.

02-27-2012, 09:34 PM	#1
adoucette Member Posts: 24 Karma: 140 Join Date: Sep 2011 Device: Nook Color (rooted?)	Script to scrape page for a cover image for recipe? I'm making a recipe to capture an online periodical. The cover image is always on the front page with a file name like "cover0212_227033.jpg" where the "0212" portion is the month and year and the "_227033" is random. So if it had to use a regular expression, I think the script would need to look for Code: cover\d\d\d\d_.*\.jpg (if that helps). Is there a way to put this into the Calibre recipe to grab the cover? Thanks, Ari

02-27-2012, 10:54 PM	#2
kovidgoyal creator of calibre Posts: 44,351 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: def get_cover_url(self): soup = self.index_to_soup(url_of_cover_page) cov = soup.find('img', src=re.compile(r'.cover\d{4}_\d{6}.jpg$')) if cov is not None: self.cover_url = cov['src'] Last edited by kovidgoyal; 02-28-2012 at 09:37 AM.*

02-28-2012, 08:27 AM	#3
adoucette Member Posts: 24 Karma: 140 Join Date: Sep 2011 Device: Nook Color (rooted?)	Thanks for the reply Kovid - you have a great product and I've certainly donated. The regexp works properly, but I get a syntax error when I try to add/update the recipe in Calibre. Here's the syntax I'm using: Code: def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cov = soup.find('img', src=re.compile(r'\w?cover\w{1,22}\.jpg') if is not None: self.cover_url = cov['src'] and the error I get is Code: calibre, version 0.8.41 ERROR: Invalid input: <p>Could not create recipe. Error:<br>invalid syntax (<string>, line 13) Last edited by adoucette; 02-29-2012 at 07:17 AM.*

02-28-2012, 10:11 PM	#7
adoucette Member Posts: 24 Karma: 140 Join Date: Sep 2011 Device: Nook Color (rooted?)	That fixes the syntax error, thanks. But it does not download a cover... I still get the default Calibre cover. Code: def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cov = soup.find('img', src=re.compile(r'\w?cover\w{1,22}\.jpg')) if cov is not None: self.cover_url = cov['src'] Any thoughts on that? I've tried it with the full url also, as Code: cov = soup.find('img', src=re.compile(r'http\S+?cover\w{1,22}\.jpg')) and still get the same effect Last edited by adoucette; 02-29-2012 at 07:17 AM.*

02-29-2012, 07:19 AM	#11
adoucette Member Posts: 24 Karma: 140 Join Date: Sep 2011 Device: Nook Color (rooted?)	Thanks Kovid, that did solve the issue. The final code for getting the cover is: Code: def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cover_item = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')) if cover_item: cover_url = cover_item['src'] return cover_url I'll PM you the final recipe if you want it for Calibre.

02-28-2012, 08:56 AM	#4
kovidgoyal creator of calibre Posts: 44,351 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	if cov is not None

02-28-2012, 09:37 AM	#6
kovidgoyal creator of calibre Posts: 44,351 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Missing closing bracket after the soup.find cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg'))

02-28-2012, 10:15 PM	#8
kovidgoyal creator of calibre Posts: 44,351 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	print out the value of self.cover_url check that it is correct, for example, it might be relative, or the regex might need to be adjusted.

02-29-2012, 12:01 AM	#10
kovidgoyal creator of calibre Posts: 44,351 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Sorry, rather than doing self.url = .. you need return ...

02-29-2012, 08:15 AM	#12
kovidgoyal creator of calibre Posts: 44,351 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you think the recipe is of value to others, post it here, I will add it to calibre automatically.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Page blank before and after book image page	osiris12	Sigil	12	05-28-2015 04:27 PM
scrape txt	jlutes	Library Management	1	08-07-2011 09:20 AM
g cover image without losing first page of text	fictionaddiction	Conversion	1	07-22-2011 02:27 PM
image on separate page without half-page text next	Toxaris	ePub	2	01-26-2011 03:32 AM
Help with Recipe - Image Sizes	Tegan	Recipes	10	01-14-2011 03:52 PM

Advert

Advert