Quote:
Originally Posted by kovidgoyal
There isn't (recursion following happens in a whole different module). The only workaround is to use preprocess_html and index_to_soup to do it manually
|
Thank you. Your comment still saves me a lot of effort.
Another question ( I know, they are endless - I will not be offended if you do not answer - I'm sure your time is better spent writing code than dragging me up the learning slope.)
I keep getting this error (below 'Processing images...') when trying to get food recipe images:
The image it says is "Not Found," however, is easily retrieved in FireFox. I've tried looking at the headers in a FireFox session, I've considered, maybe it is a robots.txt, cookies or user agent issue, but I can't seem to figure it out. It retrieves fine in FF when I block cookies, and AFAICT, the fetch process uses a FF user agent and ignores robots.txt. I've even tried using a delay. Is this something I need to use mechanize for and fetch the image in a browser session, or am I missing something simpler?
Edit:
I think I've figured it out. There is an ASCII 0A character in the middle of the link in the page source, right where it breaks after 'http://www.epicurious.com/images/recipesmenus/2010/2010_february/' before '357252_116.jpg.'
I see another error in the output where it says it can't find 'http://www.epicurious.com/images/recipesmenus/2010/2010_february/%20357252_116.jpg' (Note percent 20 char).
The problem seems to be in the page source, but I'm not sure why it works in FF? Perhaps FF is cleaning it up somehow. Do I need to do a preprocess_html to fix this?