MobileRead Forums - View Single Post

wcooley · 03-22-2011, 04:47 PM

Quote:

Originally Posted by kovidgoyal

Simply remove the remove_tags, keep_only_tags etc fields from the recipe.

Thanks; I should've mentioned that I've tried that. I should've also mentioned I'm on 0.7.50.

Quote:

Originally Posted by kovidgoyal

And if you want to look at downloaded html, implement the preprocess_html method in the recipe and save the soup yourself to a temp files.

Thanks; preprocess_html does not seem to get called in this case, but I was able to put it in the parse_index function. (Well, it is now for the articles returned from parse_index.)

pprinting 'ans' that is returned from parse_index, it seems that it hasn't found any sections or content: [('Front Page', [])]

Ok, now that I'm actually digging into it and not just trying ad hoc debugging, I'm seeing that there are a number of problems. First, the article URL in the actual content is relative, so re.compile('^http://lwn.net/Articles/') should be re.compile('^(http://lwn.net)?/Articles/'). But then at the end of the loop it's setting content='' in the article dict, so 'http://lwn.net' has to be prepended to url. But that means that rather than just using the article text that's inline, we're re-downloading each article individually. Yuck.

I'll submit an updated recipe when I have something satisfactory.