View Single Post
Old 02-10-2009, 10:15 AM   #50
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by Valloric View Post
Again, this would work for some input, but not for all. I also put "the intent of the author" in that prerequisite too. The author of the original file could write relatively complex HTML that does not validate and that you could not convert into standards compliant XHTML which faithfully represents the input file.

There's really no point discussing it, this is computer science 101: conversion of input from one language with non-deterministic rules (that is, non-validating HTML) to another with deterministic rules (standards compliant XHTML) whilst keeping all of the source information. An algorithm to perform this conversion for all input cannot be designed. It is theoretically impossible.
I really don't understand what you're getting at I'm afraid. I could write "fubby ducky loopy sunbird" and mean "Good morning, how are you?" and there would be no chance of conversion because the intent is all in my mind. With arbitrarily bad HTML the only possible interpretation of the author's intent is how some renderer renders that content. All contemporary HTML renderers use the same CSS box model for all rendering. Converting arbitrarily bad HTML into XHTML which displays the same is simply a matter applying the same rules the browser does in order to produce the box model instance it renders.

XHTML validity is a property of two components: XML validity and adherence to the XHTML schema, yah? Conversion of HTML w/o closing tags to valid XML with complete elements can be tricky, but the browser necessarily does essential the same thing in deciding what content ends up within what boxes. The Python lxml.html library calibre uses does an excellent job, matching for all practical purposes what most Web browsers produce. Producing schema-validating XHTML is where my proposal to strip all semantic tags comes in. CSS-based rendering doesn't care if you have a <div/> within a <p/> or a <sup/> within an <a/>. One just needs to extract the CSS applied to each element, then convert the element tags into ones which validate against the schema.
llasram is offline   Reply With Quote