Also, this article of mine might be of interest:
http://www.pepak.net/e-books/vycisteni-html-knihy/
It deals with cleaning up HTML source (from FineReader) to the state you see in that Unspeakable People demo using regular expressions. Unfortunately, it is written in Czech language, but you may be OK with
Google Translation. Quick look reveals gems such as "Cutting off heads" (="Remove headers"), but it will give you an idea (you MUST combine it with the Czech version, though, because Google Translator destroys all CODE blocks) and besides, regular expressions and HTML are the same in all languages. Also, I provide ZIPped source files before and after each cleanup step, which will guide you a bit more.
If there is enough interest, I may be willing to translate the article to english eventually.