07-07-2011, 09:12 PM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jul 2011
Device: Kindle3
|
Grabbing and including image from another url
Hi,
I am trying to improve the Spiegel Int'l receipt. So far I got sections going and removed a lot of noise by using the print version of articles. However, printed articles does not include images. I like these and would prefer to include them. I have written the following simple script which grabs an image from the non-printed version of an article: Code:
def get_img(self, url): txt = BeautifulSoup(urllib2.urlopen(url)) img = soup.find('div', {'class' : 'spGalleryBigPic'}).find('img')['src'] My questions are
These questions are probably obvious, but I did not seem to be able to find all of the documentation that I wanted in on receipt... Cheers, Rasmus PS: I wrote a receipt for Economist's Daily Chart. Should I share it or what is the custom for these things? Should I share my improved Spiegel receipt? |
07-07-2011, 10:49 PM | #2 |
creator of calibre
Posts: 43,857
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Implement preprocess_html in your recipe and inset the img elements there, calibre will do the downloading for you. Sharing your recipes is always a good thing.
Last edited by kovidgoyal; 07-07-2011 at 10:58 PM. |
07-08-2011, 05:54 AM | #3 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jul 2011
Device: Kindle3
|
Thanks for the quick reply. I wasn't specific enough.
I do Code:
import urllib2 I'll get the new Spiegel int'l receipt to you once I have gotten images integrated. Thanks for the tip on preprocess_html. And thanks for Calibre! I run the cli tools from a server every morning and enjoy all of my favorite news when I get up. It is great! --Rasmus |
07-08-2011, 09:19 AM | #4 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jul 2011
Device: Kindle3
|
I will need a bit more help, it seems.
This is a clean Python example: Code:
>>> u = 'http://www.spiegel.de/international/europe/0,1518,773071,00.html' >>> v = urllib2.urlopen(u).read() >>> soup = BeautifulSoup(v) # this should be identical to Calibre's Soup >>> img = soup.find('div', {'class' : 'spGalleryBigPic'}).find('img') >>> type(img) <class 'BeautifulSoup.Tag'> So I wrote the following preprocess function: Code:
def preprocess_html(self, soup): soup = soup.find('div', {'class' : 'spGalleryBigPic'}).find('img') return soup I get the article using: Code:
def print_version(self, url): 'from Spigelde.receipt' rmt = url.rpartition('#')[0] main, sep, rest = rmt.rpartition(',') rmain, rsep, rrest = main.rpartition(',') purl = rmain + ',druck-' + rrest + ',' + rest return purl |
07-08-2011, 12:06 PM | #5 |
creator of calibre
Posts: 43,857
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use the index_to_soup() method instead of urllib2. WHat you want to do in preprocess_html is
run index_to_soup, find the img element you want in the soup returned by index_to_soup and insert it into the soup passed into preprocess_html. See the BeautifulSoup docs for how to do that. |
Tags |
calibre, images, receipt |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
get print-url and somtimes non-print-url | schuster | Recipes | 4 | 05-28-2011 03:01 AM |
Grabbing pages with wget and using ebook-convert | TheLazy1 | Conversion | 2 | 05-26-2011 10:40 AM |
Including images from PML | macr0t0r | Calibre | 13 | 12-05-2009 01:52 AM |
Need help "grabbing the intrest" of 12-year-old boy | purl4peace | Reading Recommendations | 39 | 07-04-2009 02:08 PM |