02-27-2012, 09:34 PM | #1 |
Member
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
|
Script to scrape page for a cover image for recipe?
I'm making a recipe to capture an online periodical. The cover image is always on the front page with a file name like "cover0212_227033.jpg" where the "0212" portion is the month and year and the "_227033" is random. So if it had to use a regular expression, I think the script would need to look for
Code:
cover\d\d\d\d_.*\.jpg Is there a way to put this into the Calibre recipe to grab the cover? Thanks, Ari |
02-27-2012, 10:54 PM | #2 |
creator of calibre
Posts: 44,351
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
def get_cover_url(self): soup = self.index_to_soup(url_of_cover_page) cov = soup.find('img', src=re.compile(r'.*cover\d{4}_\d{6}.jpg$')) if cov is not None: self.cover_url = cov['src'] Last edited by kovidgoyal; 02-28-2012 at 09:37 AM. |
Advert | |
|
02-28-2012, 08:27 AM | #3 |
Member
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
|
Thanks for the reply Kovid - you have a great product and I've certainly donated.
The regexp works properly, but I get a syntax error when I try to add/update the recipe in Calibre. Here's the syntax I'm using: Code:
def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg') if is not None: self.cover_url = cov['src'] Code:
calibre, version 0.8.41 ERROR: Invalid input: <p>Could not create recipe. Error:<br>invalid syntax (<string>, line 13) Last edited by adoucette; 02-29-2012 at 07:17 AM. |
02-28-2012, 08:56 AM | #4 |
creator of calibre
Posts: 44,351
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
if cov is not None
|
02-28-2012, 09:35 AM | #5 |
Member
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
|
No, same error - I neglected 'cov' just here on the forum - my bad.
Here's the text of the recipe with the syntax error: Code:
class AdvancedUserRecipe1330393641(BasicNewsRecipe): title = u'abc' oldest_article = 30 max_articles_per_feed = 100 auto_cleanup = True feeds = [I took these out for the forum post here to save space] def print_version(self, url): return url.replace('/article/', '/printarticle/') def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg') if cov is not None: self.cover_url = cov['src'] Code:
calibre, version 0.8.41 ERROR: Invalid input: <p>Could not create recipe. Error:<br>invalid syntax (<string>, line 13) Last edited by adoucette; 02-29-2012 at 07:17 AM. |
Advert | |
|
02-28-2012, 09:37 AM | #6 |
creator of calibre
Posts: 44,351
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Missing closing bracket after the soup.find
cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')) |
02-28-2012, 10:11 PM | #7 |
Member
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
|
That fixes the syntax error, thanks. But it does not download a cover... I still get the default Calibre cover.
Code:
def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')) if cov is not None: self.cover_url = cov['src'] Code:
cov = soup.find('img', src=re.compile(r'http\S+?cover\w{1,22}\.jpg')) Last edited by adoucette; 02-29-2012 at 07:17 AM. |
02-28-2012, 10:15 PM | #8 |
creator of calibre
Posts: 44,351
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
print out the value of self.cover_url check that it is correct, for example, it might be relative, or the regex might need to be adjusted.
|
02-28-2012, 10:42 PM | #9 |
Member
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
|
I get this:
Code:
1% Trying to download cover... At this point in the code the value of variable cov is: <img alt="" src="http://media.abc.com/images/2012/01/30/cover0212_227033.jpg" /> At this point in the code the value of variable self.cover_url is: http://media.abc.com/images/2012/01/30/cover0212_227033.jpg So is there something I'm missing as to how to actually assign the image as the cover? Is there some obvious thing I've left out of the end of the recipe? Again, the whole recipe I have so far is: (with feeds list truncated to save space) Code:
class AdvancedUserRecipe1330393641(BasicNewsRecipe): title = u'abc' oldest_article = 30 max_articles_per_feed = 100 auto_cleanup = True def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cov = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')) if cov is not None: self.cover_url = cov['src'] print 'At this point in the code the value of variable cov is: ', cov print 'At this point in the code the value of variable self.cover_url is: ', self.cover_url feeds = [(...)] def print_version(self, url): return url.replace('/article/', '/printarticle/') Last edited by adoucette; 02-29-2012 at 07:16 AM. |
02-29-2012, 12:01 AM | #10 |
creator of calibre
Posts: 44,351
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sorry, rather than doing self.url = .. you need return ...
|
02-29-2012, 07:19 AM | #11 |
Member
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
|
Thanks Kovid, that did solve the issue. The final code for getting the cover is:
Code:
def get_cover_url(self): cover_url = None soup = self.index_to_soup('http://www.abc.com') cover_item = soup.find('img', src=re.compile(r'\w*?cover\w{1,22}\.jpg')) if cover_item: cover_url = cover_item['src'] return cover_url |
02-29-2012, 08:15 AM | #12 |
creator of calibre
Posts: 44,351
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you think the recipe is of value to others, post it here, I will add it to calibre automatically.
|
02-29-2012, 06:24 PM | #13 | |
Member
Posts: 24
Karma: 140
Join Date: Sep 2011
Device: Nook Color (rooted?)
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Page blank before and after book image page | osiris12 | Sigil | 12 | 05-28-2015 04:27 PM |
scrape txt | jlutes | Library Management | 1 | 08-07-2011 09:20 AM |
g cover image without losing first page of text | fictionaddiction | Conversion | 1 | 07-22-2011 02:27 PM |
image on separate page without half-page text next | Toxaris | ePub | 2 | 01-26-2011 03:32 AM |
Help with Recipe - Image Sizes | Tegan | Recipes | 10 | 01-14-2011 03:52 PM |