Quote:
Originally Posted by Laurens
Use a fault-tolerant parser to process the feeds. Usually, feed parsing issues are due to relatively harmless errors such unknown entity names caused by copying HTML directly to the feed. An RSS parser should be able to process ill-formed content, just like browsers have to deal with all sorts of HTML soup.
|
Actually, no. If an XML document is invalid, it
MUST be rejected. This is not a guideline, it is a rule. Any RSS (or XML) parser that does not adhere to that, is ignoring the specification.
That being said, adding some "massaging" of the content prior to parsing could help the XML validate as well-formed, assuming the broken XML can be easily fixed to validate. Again, the users don't care about broken or invalid feeds, they just want the content "at all costs".
So what do we do? Adhere to the specification, to bring some awareness to broken feeds, or make the users happy, and ignore the specification, bringing us back into the mess that HTML created for us?
But, like the problem with "HTML soup", if we just fix the problems with invalid XML, we're going to be back in the same boat that we are with HTML, and the whole point of XML is rendered irrelevant. If content authors don't realize that their feeds are broken, there is no motivation to fix it. If we transparently fix it for them, there's no reason for them to correct their end. Its a double-edged sword.
There's a good article on XML.com on this subject titled "[font=verdana,arial,helvetica]XML on the Web Has Failed[/font][font=verdana,arial,helvetica]
". Its worth the read.[/font]
Quote:
Use link rewriting to make the links point to PDA-friendly "printable" versions of pages. Both Sunrise and JPluck have supported this for a long time already. This way you can make PDA-friendly versions of many sites that don't have a dedicated "mobile" version.
|
And this is exactly why JPluck and Sunrise and
Sitescooper will consistently fail.. they don't scale and heal as the site changes. You have to maintain templates for each site that describe what links to point to, what content to keep, and what content to strip out or ignore. As the site changes, your template has to change. If you have 5,000 templates for 5,000 websites, its a maintenance nightmare. JPluck had .jxl files, Sunrise has SDL files, Sitescooper has .site files. Its all the same thing.
This is a major factor of what killed
Sitescooper, because the user community behind maintaining those templates, found that it was just too much work to keep maintaining them. Every time the site added a new nested table tag, or changed their CMS system providing the content, or reinvented their site layout, the template had to be changed.
I've come up with an approach in a tool tool I've written that tries to be a bit smarter about looking at the upstream links found in the newsfeed's RSS to render the need for per-site "templates" irrelevant. Its a lot of work though, and I can only code against the 2,000 or so sample feed sites I know are providing "broken" content links. Its definately not fun.
Quote:
Newsfeeds are especially useful for PDA's because they can cut through the fluff and link directly to articles. Furthermore, they can be presented with a consistent layout, irrespective of the site they originate from.
|
Again, not quite. Newsfeeds give you 1 or 2 sentences that provide a teaser that describes some of the article. Clicking on the article link provided in the feed, leads you to the
full size content provider's webpage. This is most-certainly
NOT useful on a PDA; not without a lot of slicing and dicing of the fluff surrounding the content.
I think once content-providers start learning how to use feeds properly, and start building their XML in a way that is consistently producing
well-formed documents and output, we'll be in a better position. Right now, less than 30% of the content authors do (based on the random 2,000-feed test suite I have here). Having 13 incompatible "standard" formats and versions doesn't help either.
Great comments so far... keep them coming.