View Single Post
Old 03-22-2011, 03:47 PM   #7
wcooley
Junior Member
wcooley began at the beginning.
 
wcooley's Avatar
 
Posts: 8
Karma: 10
Join Date: Dec 2010
Device: Graphite Kindle DX
Quote:
Originally Posted by kovidgoyal View Post
Simply remove the remove_tags, keep_only_tags etc fields from the recipe.
Thanks; I should've mentioned that I've tried that. I should've also mentioned I'm on 0.7.50.

Quote:
Originally Posted by kovidgoyal View Post
And if you want to look at downloaded html, implement the preprocess_html method in the recipe and save the soup yourself to a temp files.
Thanks; preprocess_html does not seem to get called in this case, but I was able to put it in the parse_index function. (Well, it is now for the articles returned from parse_index.)

pprinting 'ans' that is returned from parse_index, it seems that it hasn't found any sections or content: [('Front Page', [])]

Ok, now that I'm actually digging into it and not just trying ad hoc debugging, I'm seeing that there are a number of problems. First, the article URL in the actual content is relative, so re.compile('^http://lwn.net/Articles/') should be re.compile('^(http://lwn.net)?/Articles/'). But then at the end of the loop it's setting content='' in the article dict, so 'http://lwn.net' has to be prepended to url. But that means that rather than just using the article text that's inline, we're re-downloading each article individually. Yuck.

I'll submit an updated recipe when I have something satisfactory.
wcooley is offline   Reply With Quote