View Single Post
Old 03-13-2005, 01:11 PM   #7
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
Quote:
Originally Posted by hacker
Actually, no. If an XML document is invalid, it MUST be rejected. This is not a guideline, it is a rule. Any RSS (or XML) parser that does not adhere to that, is ignoring the specification.
Who cares about the spec? As long as the RSS parser processes valid feeds correctly, I don't see a problem with attempting to process ill-formed feeds.

Quote:
Originally Posted by hacker
That being said, adding some "massaging" of the content prior to parsing could help the XML validate as well-formed, assuming the broken XML can be easily fixed to validate. Again, the users don't care about broken or invalid feeds, they just want the content "at all costs".
Exactly, which is why your argument doesn't hold.

Quote:
Originally Posted by hacker
But, like the problem with "HTML soup", if we just fix the problems with invalid XML, we're going to be back in the same boat that we are with HTML, and the whole point of XML is rendered irrelevant. If content authors don't realize that their feeds are broken, there is no motivation to fix it. If we transparently fix it for them, there's no reason for them to correct their end. Its a double-edged sword.
RSS is a lot cause. That's why aggregators like FeedDemon do enforce well-formedness when processing Atom feeds. For RSS it's just too late.

Quote:
Originally Posted by hacker
And this is exactly why JPluck and Sunrise and Sitescooper will consistently fail.. they don't scale and heal as the site changes. You have to maintain templates for each site that describe what links to point to, what content to keep, and what content to strip out or ignore. As the site changes, your template has to change. If you have 5,000 templates for 5,000 websites, its a maintenance nightmare. JPluck had .jxl files, Sunrise has SDL files, Sitescooper has .site files. Its all the same thing.
The NYT and other scripts have worked well for months. Also, scripts require almost no maintenance, almost all of them consist of only two or three lines of JavaScript. For example:

Code:
if (link.depth == 1) {
  link.uri += "&pagewanted=print";
}
I concede that the existing approach is indeed problematic when you have to update existing scripts in case something changes at the site. Users have to download the SDLs and copy documents manually. That's why I'm working on an "auto-update" mechanism for my commercial product. This feature is, coincidentally, also based on RSS/RDF.

Quote:
Originally Posted by hacker
Again, not quite. Newsfeeds give you 1 or 2 sentences that provide a teaser that describes some of the article. Clicking on the article link provided in the feed, leads you to the full size content provider's webpage. This is most-certainly NOT useful on a PDA; not without a lot of slicing and dicing of the fluff surrounding the content.
Again, you need link rewriting to make feeds useful for PDA's.
Laurens is offline   Reply With Quote