Grabbing and including image from another url

Rasmus · 07-07-2011, 09:12 PM

Hi,
I am trying to improve the Spiegel Int'l receipt.

So far I got sections going and removed a lot of noise by using the print version of articles.

However, printed articles does not include images. I like these and would prefer to include them.

I have written the following simple script which grabs an image from the non-printed version of an article:

Code:

    def get_img(self, url):
        txt = BeautifulSoup(urllib2.urlopen(url))
        img = soup.find('div', {'class' : 'spGalleryBigPic'}).find('img')['src']

I assume that using some magic trick ebook-convert knows url (it does in the print function).

My questions are

How do I include this picture in the actual ebook?
Am I 'allowed' to use urllib2 in receipts or is some other method preferred? (Most likely it is)
Should I add some kind of robustness to img or will Calibre handle it? (Probably not; I guess I could just try img to avoid exceptions)

These questions are probably obvious, but I did not seem to be able to find all of the documentation that I wanted in on receipt...

Cheers,
Rasmus

PS: I wrote a receipt for Economist's Daily Chart. Should I share it or what is the custom for these things? Should I share my improved Spiegel receipt?

kovidgoyal · 07-07-2011, 10:49 PM

Implement preprocess_html in your recipe and inset the img elements there, calibre will do the downloading for you. Sharing your recipes is always a good thing.

Rasmus · 07-08-2011, 05:54 AM

Thanks for the quick reply. I wasn't specific enough.

I do

Code:

import urllib2

in order to download the html page, which in turn is turned into soup. How could I internalize this?

I'll get the new Spiegel int'l receipt to you once I have gotten images integrated. Thanks for the tip on preprocess_html.

And thanks for Calibre! I run the cli tools from a server every morning and enjoy all of my favorite news when I get up. It is great!

--Rasmus

Rasmus · 07-08-2011, 09:19 AM

I will need a bit more help, it seems.

This is a clean Python example:

Code:

>>> u = 'http://www.spiegel.de/international/europe/0,1518,773071,00.html'
>>> v = urllib2.urlopen(u).read()
>>> soup = BeautifulSoup(v) # this should be identical to Calibre's Soup
>>> img = soup.find('div', {'class' : 'spGalleryBigPic'}).find('img')
>>> type(img)
     <class 'BeautifulSoup.Tag'>

So basically, I want to insert this into the article (below the heading, but for now I just want to get it into the epub article).

So I wrote the following preprocess function:

Code:

    def preprocess_html(self, soup):
        soup = soup.find('div', {'class' : 'spGalleryBigPic'}).find('img')
        return soup

which returns what I called img above.

I get the article using:

Code:

    def print_version(self, url): 
        'from Spigelde.receipt'
        rmt = url.rpartition('#')[0]
        main, sep, rest = rmt.rpartition(',')
        rmain, rsep, rrest = main.rpartition(',')
        purl = rmain + ',druck-' + rrest + ',' + rest
        return purl

But currently it only works when I do not use preprocess_html.

kovidgoyal · 07-08-2011, 12:06 PM

Use the index_to_soup() method instead of urllib2. WHat you want to do in preprocess_html is

run index_to_soup, find the img element you want in the soup returned by index_to_soup and insert it into the soup passed into preprocess_html. See the BeautifulSoup docs for how to do that.

07-07-2011, 09:12 PM	#1
Rasmus Junior Member Posts: 5 Karma: 10 Join Date: Jul 2011 Device: Kindle3	Grabbing and including image from another url Hi, I am trying to improve the Spiegel Int'l receipt. So far I got sections going and removed a lot of noise by using the print version of articles. However, printed articles does not include images. I like these and would prefer to include them. I have written the following simple script which grabs an image from the non-printed version of an article: Code: def get_img(self, url): txt = BeautifulSoup(urllib2.urlopen(url)) img = soup.find('div', {'class' : 'spGalleryBigPic'}).find('img')['src'] I assume that using some magic trick ebook-convert knows url (it does in the print function). My questions are How do I include this picture in the actual ebook? Am I 'allowed' to use urllib2 in receipts or is some other method preferred? (Most likely it is) Should I add some kind of robustness to img or will Calibre handle it? (Probably not; I guess I could just try img to avoid exceptions) These questions are probably obvious, but I did not seem to be able to find all of the documentation that I wanted in on receipt... Cheers, Rasmus PS: I wrote a receipt for Economist's Daily Chart. Should I share it or what is the custom for these things? Should I share my improved Spiegel receipt?

07-07-2011, 10:49 PM	#2
kovidgoyal creator of calibre Posts: 43,857 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Implement preprocess_html in your recipe and inset the img elements there, calibre will do the downloading for you. Sharing your recipes is always a good thing. Last edited by kovidgoyal; 07-07-2011 at 10:58 PM.

07-08-2011, 05:54 AM	#3
Rasmus Junior Member Posts: 5 Karma: 10 Join Date: Jul 2011 Device: Kindle3	Thanks for the quick reply. I wasn't specific enough. I do Code: import urllib2 in order to download the html page, which in turn is turned into soup. How could I internalize this? I'll get the new Spiegel int'l receipt to you once I have gotten images integrated. Thanks for the tip on preprocess_html. And thanks for Calibre! I run the cli tools from a server every morning and enjoy all of my favorite news when I get up. It is great! --Rasmus

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
get print-url and somtimes non-print-url	schuster	Recipes	4	05-28-2011 03:01 AM
Grabbing pages with wget and using ebook-convert	TheLazy1	Conversion	2	05-26-2011 10:40 AM
Including images from PML	macr0t0r	Calibre	13	12-05-2009 01:52 AM
Need help "grabbing the intrest" of 12-year-old boy	purl4peace	Reading Recommendations	39	07-04-2009 02:08 PM

07-08-2011, 12:06 PM	#5
kovidgoyal creator of calibre Posts: 43,857 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use the index_to_soup() method instead of urllib2. WHat you want to do in preprocess_html is run index_to_soup, find the img element you want in the soup returned by index_to_soup and insert it into the soup passed into preprocess_html. See the BeautifulSoup docs for how to do that.