I can help you do it. I had about 500 words explaining how but I lost it due to a bad connection, dammit. Here is enough to give you an idea:
Steps:
1, eyeball enough files to identify all the formatting styles;
(1a, write the programs; )
2, run all the files through an identifier program (I can write it for you), get back a copy of each file with new text added to the first few lines (needed to ID the style of a given file);
3, run all the copies of the files through a cleaner program which will perform specific actions to fix a given style.
Preferred Tools: jflex (or flex);
If I did it, I would take the information learned in step one and write regular expression to define each detail. I would then use jflex to write the actual source code for the programs in steps 2 and 3.
P.S. If you want to be really adventurous, you could combine the latter two steps by writing and running a yacc/jflex parser. It would be a lot more work, though.
P.P.S. I could do this for you. Recently I've been doing something very similar to this. The World Fact eBook I converted required several runs through various cleanup programs in order to remove the excess web formatting.
|