Originally Posted by kovidgoyal
If you accept aribtrary HTML as input and want to output standards compliant HTML the only way to do that is to basically strip the HTML down to a basic internal markup and then re-export it. This is for example what BookDesigner does. There is no way you can accept arbitrary HTML input and losslessly convert it to standards compliant HTML output (and no htmltidy doesn't do this).
So really what the tool will have to do is:
1) Accept html input
2) parse the html input into some simple internal markup
3) Try to auto identify structural components (or ask the user to provide input to help identify them)
4) Provide an editor interface for the internal markup
5) Export the internal markup to EPUB
If you do that (5), people like Coolmicro will get up and shout again that the resulting epub is not conform to standard and that the html code is not "clean". (I really do not care about "clean or dirty" code myself, as long as it does what it has to do, like Calibre does for example). So I am all for it.