![]() |
#1 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 74
Karma: 10000010
Join Date: Jan 2012
Device: Android Tablet with Calibre Companion and Moon+ Reader Pro
|
![]()
I am working on a recipe for the Boston Globe (with subscription). Certain images are often missing (for example, the editorial cartoon), and I don't understand why. Here's an example from today's attempt:
Code:
Fetching http://www.bostonglobe.com/opinion/2012/03/27/editorial-cartoon-romney-house-plans/GIkU9kAMRSFiNAxvKhb5ZO/story.html Traceback (most recent call last): File "site-packages\calibre\web\fetch\simple.py", line 346, in process_images File "site-packages\PIL\Image.py", line 1982, in open IOError: cannot identify image file Recursion limit reached. Skipping links in http://www.bostonglobe.com/opinion/2012/03/27/editorial-cartoon-romney-house-plans/GIkU9kAMRSFiNAxvKhb5ZO/story.html Code:
<img src="/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif" data-fullsrc="/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif" alt=" "> Code:
http://www.bostonglobe.com/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif One thought I had was that perhaps calibre doesn't support TIF files, but I couldn't find a list of supported image types anywhere. If that's not the problem, does anyone have any ideas of what might be going on? |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,196
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That indicates that the image is not recognized as avalid image. This can be either because the img is actually not valid or the image format is not supported. I cant recall if PIL supports the TIFF format you can check that by doing
calibre-debug -c "from PIL import Image; im = Image(); im.open('somefile.tiff')" |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 74
Karma: 10000010
Join Date: Jan 2012
Device: Android Tablet with Calibre Companion and Moon+ Reader Pro
|
Thanks - that helps me move forward a bit.
PIL.Image doesn't seem to like URLs for filenames, so I wrote the following script to fetch the image: Code:
import string import mechanize from PIL import Image br = mechanize.Browser() response = br.open("http://www.bostonglobe.com/rf/image_960w/Boston/2011-2020/2012/03/27/BostonGlobe.com/EditorialOpinion/Images/03.28ROMNEYHOUSE.tif") data = response.get_data() print data[0:32] f = open("image.tif", "wb") f.write(data) f.close() im = Image.open("image.tif") print "Format:", im.format, ", Mode:", im.mode, ", Size:", im.size Code:
\377\330\377\340^@^PJFIF^@^A^A^@^@^A^@^A^@^@\377\355^@,Photosho Format: JPEG , Mode: L , Size: (960, 750) Is there a way I can turn on extra logging, or use the Python debugger to see what's going on in the recipe? I feel like if I could get to the point where it's failing and debug through it, I could figure out what's going on. |
![]() |
![]() |
![]() |
#4 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 74
Karma: 10000010
Join Date: Jan 2012
Device: Android Tablet with Calibre Companion and Moon+ Reader Pro
|
I wondered whether the server might be misreporting the file type (and whether that would lead to the error I'm seeing), so I ran Fiddler and went to the image's URL in my browser. Here's the headers and beginning of the data:
Code:
HTTP/1.1 200 OK Eomportal-Instance: 15 Last-Modified: Wed, 28 Mar 2012 04:52:44 GMT Cache-Control: max-age=86400, must-revalidate Content-Type: image/jpeg Content-Length: 133218 Date: Wed, 28 Mar 2012 16:40:51 GMT Connection: close Server: BostonGlobe.com Frontend �����JFIF���������,Photoshop 3.0�8BIM������,����,�������C� 1. This is a valid JPEG file 2. The server correctly identifies its type 3. The file's extension incorrectly identifies its type 4. PIL.Image handles the file correctly as a JPEG 5. Using calibre-debug to execute a script that fetches the file and loads it using PIL.Image succeeds Incidentally, it does not appear that you need to have a Boston Globe subscription to fetch this file - the get_image script that I posted earlier doesn't do any login. |
![]() |
![]() |
![]() |
#5 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 74
Karma: 10000010
Join Date: Jan 2012
Device: Android Tablet with Calibre Companion and Moon+ Reader Pro
|
![]()
This wasn't an image format problem at all - it was a source problem. The <img> tag in question looked like this:
<img src="" data-fullsrc="/the/real/path/to/the/image"/> rather than this: <img src="/the/real/path/to/the/image" data-fullsrc="/the/real/path/to/the/image"/> When I used Chrome (or any other browser) to look at the page containing the image, the src attribute was filled in properly. I don't know whether that's the browser being incredibly smart and figuring out the src or BeautifulSoup making a mistake and dropping it. But it's not really important, as long as I have a solution. Which I do. Since this was so hard for me to debug, I thought I'd post what I did in case anyone else hits a similar problem. I figured out what was happening by doing two things:
To fix it, I just have preprocess_html() find images with blank src and set img["src"] = img["data-fullsrc"] (this will work because the Globe's images all seem to have data-fullsrc) |
![]() |
![]() |
Advert | |
|
![]() |
Tags |
image, ioerror, tiff |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Where else can a MOBI file have its cover image? | Alissa | Kindle Formats | 6 | 07-13-2013 02:50 PM |
Need Help removing image file | Cpl Punishment | Nook Developer's Corner | 1 | 10-08-2011 06:50 AM |
IOError - No such file or directory | SkyDream | Calibre | 7 | 11-12-2010 02:44 PM |
Remove first image in file during conversion? | itimpi | Calibre | 3 | 02-08-2009 12:57 AM |
How would you group image file from a manga? | hamh | Sony Reader | 11 | 10-18-2007 12:28 PM |