Quote:
Originally Posted by mwheinz
I think the problem is simply that the American Prospect generates truly awful HTML ..
but any attempt to remove certain tags (like the embedded advertisements) has no effect and telling it to keep certain tags (like the ones with the main articles) cause it to delete everything and generate an empty page.
|
Malformed html can be problematical. You may want to look at the soup output from preprocess_html and then use preprocess_regexps to delete material you need to get rid of.