View Single Post
Old 03-28-2012, 02:25 PM   #5
RobFreundlich
Connoisseur
RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.RobFreundlich ought to be getting tired of karma fortunes by now.
 
Posts: 74
Karma: 10000010
Join Date: Jan 2012
Device: Android Tablet with Calibre Companion and Moon+ Reader Pro
Smile SOLVED IOError

This wasn't an image format problem at all - it was a source problem. The <img> tag in question looked like this:

<img src="" data-fullsrc="/the/real/path/to/the/image"/>

rather than this:

<img src="/the/real/path/to/the/image" data-fullsrc="/the/real/path/to/the/image"/>

When I used Chrome (or any other browser) to look at the page containing the image, the src attribute was filled in properly. I don't know whether that's the browser being incredibly smart and figuring out the src or BeautifulSoup making a mistake and dropping it. But it's not really important, as long as I have a solution. Which I do.


Since this was so hard for me to debug, I thought I'd post what I did in case anyone else hits a similar problem. I figured out what was happening by doing two things:
  1. Overriding image_url_processor() and looking at both baseurl and url - url was coming through as the empty string
  2. Overriding preprocess_html() and looking at soup.findAll("img") - I could see that the src for this image was blank

To fix it, I just have preprocess_html() find images with blank src and set img["src"] = img["data-fullsrc"] (this will work because the Globe's images all seem to have data-fullsrc)
RobFreundlich is offline   Reply With Quote