Quote:
Originally Posted by Starson17
Kovid has stated on several occasions that the "link hasn't been detected!" message isn't an error.
|
Yeah, I discovered it was unrelated to the message shortly after posting.
I think the problem is simply that the American Prospect generates truly awful HTML - the problem starts on the first line of the output where you find javascript before the <!DOCTYPE> tag, for one thing, but also <meta> tags inside the body, <scripts> inside <tr> elements and newlines inside URIs. They don't even identify parts of the page with IDs so there's no easy way to identify the part with the article in it.
I was able to write a recipe that gets everything:
Code:
class AdvancedUserRecipe1273850169(BasicNewsRecipe):
title = u'American Prospect'
oldest_article = 7
max_articles_per_feed = 100
recursions = 0
no_stylesheets = True
feeds = [(u'Articles', u'feed://www.prospect.org/articles_rss.jsp')]
but any attempt to remove certain tags (like the embedded advertisements) has no effect and telling it to keep certain tags (like the ones with the main articles) cause it to delete everything and generate an empty page.