View Single Post
Old 05-14-2010, 12:54 PM   #1918
mwheinz
award-winning bozo
mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.mwheinz can program the VCR without an owner's manual.
 
Posts: 258
Karma: 172703
Join Date: Sep 2009
Location: Philadelphia
Device: Kobo Libra 2
Quote:
Originally Posted by Starson17 View Post
Kovid has stated on several occasions that the "link hasn't been detected!" message isn't an error.
Yeah, I discovered it was unrelated to the message shortly after posting.

I think the problem is simply that the American Prospect generates truly awful HTML - the problem starts on the first line of the output where you find javascript before the <!DOCTYPE> tag, for one thing, but also <meta> tags inside the body, <scripts> inside <tr> elements and newlines inside URIs. They don't even identify parts of the page with IDs so there's no easy way to identify the part with the article in it.

I was able to write a recipe that gets everything:

Code:
class AdvancedUserRecipe1273850169(BasicNewsRecipe):
    title          = u'American Prospect'
    oldest_article = 7
    max_articles_per_feed = 100
    recursions = 0
    no_stylesheets = True

    feeds       = [(u'Articles', u'feed://www.prospect.org/articles_rss.jsp')]
but any attempt to remove certain tags (like the embedded advertisements) has no effect and telling it to keep certain tags (like the ones with the main articles) cause it to delete everything and generate an empty page.
mwheinz is offline