Quote:
Originally Posted by kovidgoyal
Simply remove the remove_tags, keep_only_tags etc fields from the recipe.
|
Thanks; I should've mentioned that I've tried that. I should've also mentioned I'm on 0.7.50.
Quote:
Originally Posted by kovidgoyal
And if you want to look at downloaded html, implement the preprocess_html method in the recipe and save the soup yourself to a temp files.
|
Thanks; preprocess_html does not seem to get called in this case, but I was able to put it in the parse_index function. (Well, it is now for the articles returned from parse_index.)
pprinting 'ans' that is returned from parse_index, it seems that it hasn't found any sections or content: [('Front Page', [])]
Ok, now that I'm actually digging into it and not just trying ad hoc debugging, I'm seeing that there are a number of problems. First, the article URL in the actual content is relative, so re.compile('^
http://lwn.net/Articles/') should be re.compile('^(
http://lwn.net)?/Articles/'). But then at the end of the loop it's setting content='' in the article dict, so 'http://lwn.net' has to be prepended to url. But that means that rather than just using the article text that's inline, we're re-downloading each article individually. Yuck.
I'll submit an updated recipe when I have something satisfactory.