View Single Post
Old 02-10-2009, 08:42 AM   #49
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Quote:
Originally Posted by llasram View Post
Well, would depend on what you meant by "represent the content of the original HTML." It would be fairly easy to strip all semantic tag information from source HTML and translate into it into nothing but <div/>, <span/>, <a/>, and <img/> tags with appropriate CSS. That would make it trivial to output valid XHTML which retained exactly the same formatting characteristics as specified by the author.
Again, this would work for some input, but not for all. I also put "the intent of the author" in that prerequisite too. The author of the original file could write relatively complex HTML that does not validate and that you could not convert into standards compliant XHTML which faithfully represents the input file.

There's really no point discussing it, this is computer science 101: conversion of input from one language with non-deterministic rules (that is, non-validating HTML) to another with deterministic rules (standards compliant XHTML) whilst keeping all of the source information. An algorithm to perform this conversion for all input cannot be designed. It is theoretically impossible.

But that doesn't mean the application can't fix some errors and output valid XHTML. I'm just saying you can't guarantee compliance and not have to mangle the input in some situations. And even then it wouldn't work for some cases.

Quote:
Originally Posted by Jellby View Post
The program could accept invalid (X)HTML, and issue a warning if the final (X)HTML does not validate.
My working idea too. Fix what you can, inform about what you can't, but don't mangle the input in any way or form. It is more important to guarantee to the user that you won't make some tiny change half-way through the novel he's importing than it is to guarantee standards compliance.

You can't piss off your users by trying to twist and turn their HTML into something it can't automatically become.
Valloric is offline   Reply With Quote