Quote:
Originally Posted by llasram
Well, would depend on what you meant by "represent the content of the original HTML." It would be fairly easy to strip all semantic tag information from source HTML and translate into it into nothing but <div/>, <span/>, <a/>, and <img/> tags with appropriate CSS. That would make it trivial to output valid XHTML which retained exactly the same formatting characteristics as specified by the author.
|
Again, this would work
for some input, but not for all. I also put "the intent of the author" in that prerequisite too. The author of the original file could write relatively complex HTML that does not validate and that you could not convert into standards compliant XHTML which faithfully represents the input file.
There's really no point discussing it, this is computer science 101: conversion of input from one language with non-deterministic rules (that is, non-validating HTML) to another with deterministic rules (standards compliant XHTML) whilst keeping all of the source information. An algorithm to perform this conversion
for all input cannot be designed. It is theoretically impossible.
But that doesn't mean the application can't fix
some errors and output valid XHTML. I'm just saying you can't guarantee compliance and not have to mangle the input in some situations. And even then it wouldn't work for some cases.
Quote:
Originally Posted by Jellby
The program could accept invalid (X)HTML, and issue a warning if the final (X)HTML does not validate.
|
My working idea too. Fix what you can, inform about what you can't, but don't mangle the input in any way or form. It is more important to guarantee to the user that you won't make some tiny change half-way through the novel he's importing than it is to guarantee standards compliance.
You can't piss off your users by trying to twist and turn their HTML into something it can't automatically become.