Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-28-2012, 08:39 AM   #1
RobFreundlich
Enthusiast
RobFreundlich began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Jan 2012
Device: Kindle Fire
Question IOError: cannot identify image file

I am working on a recipe for the Boston Globe (with subscription). Certain images are often missing (for example, the editorial cartoon), and I don't understand why. Here's an example from today's attempt:

Code:
Fetching http://www.bostonglobe.com/opinion/2012/03/27/editorial-cartoon-romney-house-plans/GIkU9kAMRSFiNAxvKhb5ZO/story.html
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 346, in process_images
  File "site-packages\PIL\Image.py", line 1982, in open
IOError: cannot identify image file
Recursion limit reached. Skipping links in http://www.bostonglobe.com/opinion/2012/03/27/editorial-cartoon-romney-house-plans/GIkU9kAMRSFiNAxvKhb5ZO/story.html
The HTML for the image in question is this:

Code:
<img src="/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif" data-fullsrc="/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif" alt="
">
The image's URL is correct - going to

Code:
http://www.bostonglobe.com/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif
does display the image (provided you've got a subscription, of course, which I do).

One thought I had was that perhaps calibre doesn't support TIF files, but I couldn't find a list of supported image types anywhere.

If that's not the problem, does anyone have any ideas of what might be going on?
RobFreundlich is offline   Reply With Quote
Old 03-28-2012, 10:13 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,400
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That indicates that the image is not recognized as avalid image. This can be either because the img is actually not valid or the image format is not supported. I cant recall if PIL supports the TIFF format you can check that by doing

calibre-debug -c "from PIL import Image; im = Image(); im.open('somefile.tiff')"
kovidgoyal is offline   Reply With Quote
Old 03-28-2012, 12:37 PM   #3
RobFreundlich
Enthusiast
RobFreundlich began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Jan 2012
Device: Kindle Fire
Thanks - that helps me move forward a bit.

PIL.Image doesn't seem to like URLs for filenames, so I wrote the following script to fetch the image:

Code:
import string
import mechanize
from PIL import Image

br = mechanize.Browser()
response = br.open("http://www.bostonglobe.com/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif")
data = response.get_data()

print data[0:32]

f = open("image.tif", "wb")
f.write(data)
f.close()

im = Image.open("image.tif")
print "Format:", im.format, ", Mode:", im.mode, ", Size:", im.size
It outputs the following:

Code:
\377\330\377\340^@^PJFIF^@^A^A^@^@^A^@^A^@^@\377\355^@,Photosho
Format: JPEG , Mode: L , Size: (960, 750)
So the file is a bit weird - it's got a .tif extension, but it is actually a JPEG (as witness both by the Format value and the JFIF tag). But the Image class seems to handle it just fine anyway.

Is there a way I can turn on extra logging, or use the Python debugger to see what's going on in the recipe? I feel like if I could get to the point where it's failing and debug through it, I could figure out what's going on.
RobFreundlich is offline   Reply With Quote
Old 03-28-2012, 12:46 PM   #4
RobFreundlich
Enthusiast
RobFreundlich began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Jan 2012
Device: Kindle Fire
I wondered whether the server might be misreporting the file type (and whether that would lead to the error I'm seeing), so I ran Fiddler and went to the image's URL in my browser. Here's the headers and beginning of the data:

Code:
HTTP/1.1 200 OK
Eomportal-Instance: 15
Last-Modified: Wed, 28 Mar 2012 04:52:44 GMT
Cache-Control: max-age=86400, must-revalidate
Content-Type: image/jpeg
Content-Length: 133218
Date: Wed, 28 Mar 2012 16:40:51 GMT
Connection: close
Server: BostonGlobe.com Frontend

�����JFIF���������,Photoshop 3.0�8BIM������,����,�������C�
The server is correctly identifying the image as a JPEG. So, to sum up what we know so far:

1. This is a valid JPEG file
2. The server correctly identifies its type
3. The file's extension incorrectly identifies its type
4. PIL.Image handles the file correctly as a JPEG
5. Using calibre-debug to execute a script that fetches the file and loads it using PIL.Image succeeds

Incidentally, it does not appear that you need to have a Boston Globe subscription to fetch this file - the get_image script that I posted earlier doesn't do any login.
RobFreundlich is offline   Reply With Quote
Old 03-28-2012, 02:25 PM   #5
RobFreundlich
Enthusiast
RobFreundlich began at the beginning.
 
Posts: 44
Karma: 10
Join Date: Jan 2012
Device: Kindle Fire
Smile SOLVED IOError

This wasn't an image format problem at all - it was a source problem. The <img> tag in question looked like this:

<img src="" data-fullsrc="/the/real/path/to/the/image"/>

rather than this:

<img src="/the/real/path/to/the/image" data-fullsrc="/the/real/path/to/the/image"/>

When I used Chrome (or any other browser) to look at the page containing the image, the src attribute was filled in properly. I don't know whether that's the browser being incredibly smart and figuring out the src or BeautifulSoup making a mistake and dropping it. But it's not really important, as long as I have a solution. Which I do.


Since this was so hard for me to debug, I thought I'd post what I did in case anyone else hits a similar problem. I figured out what was happening by doing two things:
  1. Overriding image_url_processor() and looking at both baseurl and url - url was coming through as the empty string
  2. Overriding preprocess_html() and looking at soup.findAll("img") - I could see that the src for this image was blank

To fix it, I just have preprocess_html() find images with blank src and set img["src"] = img["data-fullsrc"] (this will work because the Globe's images all seem to have data-fullsrc)
RobFreundlich is offline   Reply With Quote
Reply

Tags
image, ioerror, tiff

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Where else can a MOBI file have its cover image? Alissa Kindle Formats 6 07-13-2013 02:50 PM
Need Help removing image file Cpl Punishment Nook Developer's Corner 1 10-08-2011 06:50 AM
IOError - No such file or directory SkyDream Calibre 7 11-12-2010 02:44 PM
Remove first image in file during conversion? itimpi Calibre 3 02-08-2009 12:57 AM
How would you group image file from a manga? hamh Sony Reader 11 10-18-2007 12:28 PM


All times are GMT -4. The time now is 02:52 PM.


MobileRead.com is a privately owned, operated and funded community.