View Single Post
Old 04-24-2010, 05:06 AM   #4
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Also, this article of mine might be of interest:
http://www.pepak.net/e-books/vycisteni-html-knihy/
It deals with cleaning up HTML source (from FineReader) to the state you see in that Unspeakable People demo using regular expressions. Unfortunately, it is written in Czech language, but you may be OK with Google Translation. Quick look reveals gems such as "Cutting off heads" (="Remove headers"), but it will give you an idea (you MUST combine it with the Czech version, though, because Google Translator destroys all CODE blocks) and besides, regular expressions and HTML are the same in all languages. Also, I provide ZIPped source files before and after each cleanup step, which will guide you a bit more.

If there is enough interest, I may be willing to translate the article to english eventually.
pepak is offline   Reply With Quote