Quote:
Originally Posted by llasram
With arbitrarily bad HTML the only possible interpretation of the author's intent is how some renderer renders that content. All contemporary HTML renderers use the same CSS box model for all rendering.
|
Quote:
Originally Posted by llasram
The Python lxml.html library calibre uses does an excellent job, matching for all practical purposes what most Web browsers produce.
|
There is no argument here.
I agree that you could very well design an algorithm that converts non-valid HTML into valid XHTML for
most HTML people will write. It's what your "lxml.html" library does (although I've never used it) and it's what Tidy does as well.
But you can't do it
for all possible arbitrarily bad HTML. You're assuming the user checked how his source displayed in a browser. If he did, then it's not a matter of parsing
arbitrarily bad HTML. It's not a non-deterministic rule system anymore: the source follows the deterministic rendering rules of the browser he used to check his work. Converting from a deterministic language to another deterministic language is certainly possible. And while you could say that the
vast majority of HTML authors would do just that (check the display in a browser) before importing, you can't categorically state it.
So let's sum this up... you
can create an algorithm that can convert
most practical non-conforming HTML into valid XHTML, but not
all HTML one could write. If one were to say he could, one would be shoving a grave ignorance of computer science theory.