View Single Post
Old 02-10-2009, 11:11 AM   #51
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Quote:
Originally Posted by llasram View Post
With arbitrarily bad HTML the only possible interpretation of the author's intent is how some renderer renders that content. All contemporary HTML renderers use the same CSS box model for all rendering.
Quote:
Originally Posted by llasram View Post
The Python lxml.html library calibre uses does an excellent job, matching for all practical purposes what most Web browsers produce.
There is no argument here.

I agree that you could very well design an algorithm that converts non-valid HTML into valid XHTML for most HTML people will write. It's what your "lxml.html" library does (although I've never used it) and it's what Tidy does as well.

But you can't do it for all possible arbitrarily bad HTML. You're assuming the user checked how his source displayed in a browser. If he did, then it's not a matter of parsing arbitrarily bad HTML. It's not a non-deterministic rule system anymore: the source follows the deterministic rendering rules of the browser he used to check his work. Converting from a deterministic language to another deterministic language is certainly possible. And while you could say that the vast majority of HTML authors would do just that (check the display in a browser) before importing, you can't categorically state it.

So let's sum this up... you can create an algorithm that can convert most practical non-conforming HTML into valid XHTML, but not all HTML one could write. If one were to say he could, one would be shoving a grave ignorance of computer science theory.
Valloric is offline   Reply With Quote