IOError: cannot identify image file

RobFreundlich · 03-28-2012, 08:39 AM

I am working on a recipe for the Boston Globe (with subscription). Certain images are often missing (for example, the editorial cartoon), and I don't understand why. Here's an example from today's attempt:

Code:

Fetching http://www.bostonglobe.com/opinion/2012/03/27/editorial-cartoon-romney-house-plans/GIkU9kAMRSFiNAxvKhb5ZO/story.html
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 346, in process_images
  File "site-packages\PIL\Image.py", line 1982, in open
IOError: cannot identify image file
Recursion limit reached. Skipping links in http://www.bostonglobe.com/opinion/2012/03/27/editorial-cartoon-romney-house-plans/GIkU9kAMRSFiNAxvKhb5ZO/story.html

The HTML for the image in question is this:

Code:

<img src="/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif" data-fullsrc="/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif" alt="
">

The image's URL is correct - going to

Code:

http://www.bostonglobe.com/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif

does display the image (provided you've got a subscription, of course, which I do).

One thought I had was that perhaps calibre doesn't support TIF files, but I couldn't find a list of supported image types anywhere.

If that's not the problem, does anyone have any ideas of what might be going on?

kovidgoyal · 03-28-2012, 10:13 AM

That indicates that the image is not recognized as avalid image. This can be either because the img is actually not valid or the image format is not supported. I cant recall if PIL supports the TIFF format you can check that by doing

calibre-debug -c "from PIL import Image; im = Image(); im.open('somefile.tiff')"

RobFreundlich · 03-28-2012, 12:37 PM

Thanks - that helps me move forward a bit.

PIL.Image doesn't seem to like URLs for filenames, so I wrote the following script to fetch the image:

Code:

import string
import mechanize
from PIL import Image

br = mechanize.Browser()
response = br.open("http://www.bostonglobe.com/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif")
data = response.get_data()

print data[0:32]

f = open("image.tif", "wb")
f.write(data)
f.close()

im = Image.open("image.tif")
print "Format:", im.format, ", Mode:", im.mode, ", Size:", im.size

It outputs the following:

Code:

\377\330\377\340^@^PJFIF^@^A^A^@^@^A^@^A^@^@\377\355^@,Photosho
Format: JPEG , Mode: L , Size: (960, 750)

So the file is a bit weird - it's got a .tif extension, but it is actually a JPEG (as witness both by the Format value and the JFIF tag). But the Image class seems to handle it just fine anyway.

Is there a way I can turn on extra logging, or use the Python debugger to see what's going on in the recipe? I feel like if I could get to the point where it's failing and debug through it, I could figure out what's going on.

RobFreundlich · 03-28-2012, 12:46 PM

I wondered whether the server might be misreporting the file type (and whether that would lead to the error I'm seeing), so I ran Fiddler and went to the image's URL in my browser. Here's the headers and beginning of the data:

Code:

HTTP/1.1 200 OK
Eomportal-Instance: 15
Last-Modified: Wed, 28 Mar 2012 04:52:44 GMT
Cache-Control: max-age=86400, must-revalidate
Content-Type: image/jpeg
Content-Length: 133218
Date: Wed, 28 Mar 2012 16:40:51 GMT
Connection: close
Server: BostonGlobe.com Frontend

�����JFIF���������,Photoshop 3.0�8BIM������,����,�������C�

The server is correctly identifying the image as a JPEG. So, to sum up what we know so far:

1. This is a valid JPEG file
2. The server correctly identifies its type
3. The file's extension incorrectly identifies its type
4. PIL.Image handles the file correctly as a JPEG
5. Using calibre-debug to execute a script that fetches the file and loads it using PIL.Image succeeds

Incidentally, it does not appear that you need to have a Boston Globe subscription to fetch this file - the get_image script that I posted earlier doesn't do any login.

RobFreundlich · 03-28-2012, 02:25 PM

This wasn't an image format problem at all - it was a source problem. The <img> tag in question looked like this:

<img src="" data-fullsrc="/the/real/path/to/the/image"/>

rather than this:

<img src="/the/real/path/to/the/image" data-fullsrc="/the/real/path/to/the/image"/>

When I used Chrome (or any other browser) to look at the page containing the image, the src attribute was filled in properly. I don't know whether that's the browser being incredibly smart and figuring out the src or BeautifulSoup making a mistake and dropping it. But it's not really important, as long as I have a solution. Which I do.

Since this was so hard for me to debug, I thought I'd post what I did in case anyone else hits a similar problem. I figured out what was happening by doing two things:

Overriding image_url_processor() and looking at both baseurl and url - url was coming through as the empty string
Overriding preprocess_html() and looking at soup.findAll("img") - I could see that the src for this image was blank

To fix it, I just have preprocess_html() find images with blank src and set img["src"] = img["data-fullsrc"] (this will work because the Globe's images all seem to have data-fullsrc)

03-28-2012, 12:37 PM	#3
RobFreundlich Connoisseur Posts: 74 Karma: 10000010 Join Date: Jan 2012 Device: Android Tablet with Calibre Companion and Moon+ Reader Pro	Thanks - that helps me move forward a bit. PIL.Image doesn't seem to like URLs for filenames, so I wrote the following script to fetch the image: Code: import string import mechanize from PIL import Image br = mechanize.Browser() response = br.open("http://www.bostonglobe.com/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif") data = response.get_data() print data[0:32] f = open("image.tif", "wb") f.write(data) f.close() im = Image.open("image.tif") print "Format:", im.format, ", Mode:", im.mode, ", Size:", im.size It outputs the following: Code: \377\330\377\340^@^PJFIF^@^A^A^@^@^A^@^A^@^@\377\355^@,Photosho Format: JPEG , Mode: L , Size: (960, 750) So the file is a bit weird - it's got a .tif extension, but it is actually a JPEG (as witness both by the Format value and the JFIF tag). But the Image class seems to handle it just fine anyway. Is there a way I can turn on extra logging, or use the Python debugger to see what's going on in the recipe? I feel like if I could get to the point where it's failing and debug through it, I could figure out what's going on.

03-28-2012, 12:46 PM	#4
RobFreundlich Connoisseur Posts: 74 Karma: 10000010 Join Date: Jan 2012 Device: Android Tablet with Calibre Companion and Moon+ Reader Pro	I wondered whether the server might be misreporting the file type (and whether that would lead to the error I'm seeing), so I ran Fiddler and went to the image's URL in my browser. Here's the headers and beginning of the data: Code: HTTP/1.1 200 OK Eomportal-Instance: 15 Last-Modified: Wed, 28 Mar 2012 04:52:44 GMT Cache-Control: max-age=86400, must-revalidate Content-Type: image/jpeg Content-Length: 133218 Date: Wed, 28 Mar 2012 16:40:51 GMT Connection: close Server: BostonGlobe.com Frontend ��JFIF��,Photoshop 3.0�8BIM��,��,��C� The server is correctly identifying the image as a JPEG. So, to sum up what we know so far: 1. This is a valid JPEG file 2. The server correctly identifies its type 3. The file's extension incorrectly identifies its type 4. PIL.Image handles the file correctly as a JPEG 5. Using calibre-debug to execute a script that fetches the file and loads it using PIL.Image succeeds Incidentally, it does not appear that you need to have a Boston Globe subscription to fetch this file - the get_image script that I posted earlier doesn't do any login.

03-28-2012, 02:25 PM	#5
RobFreundlich Connoisseur Posts: 74 Karma: 10000010 Join Date: Jan 2012 Device: Android Tablet with Calibre Companion and Moon+ Reader Pro	SOLVED IOError This wasn't an image format problem at all - it was a source problem. The <img> tag in question looked like this: <img src="" data-fullsrc="/the/real/path/to/the/image"/> rather than this: <img src="/the/real/path/to/the/image" data-fullsrc="/the/real/path/to/the/image"/> When I used Chrome (or any other browser) to look at the page containing the image, the src attribute was filled in properly. I don't know whether that's the browser being incredibly smart and figuring out the src or BeautifulSoup making a mistake and dropping it. But it's not really important, as long as I have a solution. Which I do. Since this was so hard for me to debug, I thought I'd post what I did in case anyone else hits a similar problem. I figured out what was happening by doing two things: Overriding image_url_processor() and looking at both baseurl and url - url was coming through as the empty string Overriding preprocess_html() and looking at soup.findAll("img") - I could see that the src for this image was blank To fix it, I just have preprocess_html() find images with blank src and set img["src"] = img["data-fullsrc"] (this will work because the Globe's images all seem to have data-fullsrc)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Where else can a MOBI file have its cover image?	Alissa	Kindle Formats	6	07-13-2013 02:50 PM
Need Help removing image file	Cpl Punishment	Nook Developer's Corner	1	10-08-2011 06:50 AM
IOError - No such file or directory	SkyDream	Calibre	7	11-12-2010 02:44 PM
Remove first image in file during conversion?	itimpi	Calibre	3	02-08-2009 12:57 AM
How would you group image file from a manga?	hamh	Sony Reader	11	10-18-2007 12:28 PM

03-28-2012, 10:13 AM	#2
kovidgoyal creator of calibre Posts: 45,307 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That indicates that the image is not recognized as avalid image. This can be either because the img is actually not valid or the image format is not supported. I cant recall if PIL supports the TIFF format you can check that by doing calibre-debug -c "from PIL import Image; im = Image(); im.open('somefile.tiff')"

Advert

Advert