This wasn't an image format problem at all - it was a source problem. The <img> tag in question looked like this:
<img src="" data-fullsrc="/the/real/path/to/the/image"/>
rather than this:
<img src="/the/real/path/to/the/image" data-fullsrc="/the/real/path/to/the/image"/>
When I used Chrome (or any other browser) to look at the page containing the image, the src attribute was filled in properly. I don't know whether that's the browser being incredibly smart and figuring out the src or BeautifulSoup making a mistake and dropping it. But it's not really important, as long as I have a solution. Which I do.
Since this was so hard for me to debug, I thought I'd post what I did in case anyone else hits a similar problem. I figured out what was happening by doing two things:
- Overriding image_url_processor() and looking at both baseurl and url - url was coming through as the empty string
- Overriding preprocess_html() and looking at soup.findAll("img") - I could see that the src for this image was blank
To fix it, I just have preprocess_html() find images with blank src and set img["src"] = img["data-fullsrc"] (this will work because the Globe's images all seem to have data-fullsrc)