View Single Post
Old 04-27-2009, 04:38 PM   #2
Nate the great
Sir Penguin of Edinburgh
Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.
 
Nate the great's Avatar
 
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
I can help you do it. I had about 500 words explaining how but I lost it due to a bad connection, dammit. Here is enough to give you an idea:

Steps:
1, eyeball enough files to identify all the formatting styles;

(1a, write the programs; )

2, run all the files through an identifier program (I can write it for you), get back a copy of each file with new text added to the first few lines (needed to ID the style of a given file);

3, run all the copies of the files through a cleaner program which will perform specific actions to fix a given style.

Preferred Tools: jflex (or flex);

If I did it, I would take the information learned in step one and write regular expression to define each detail. I would then use jflex to write the actual source code for the programs in steps 2 and 3.


P.S. If you want to be really adventurous, you could combine the latter two steps by writing and running a yacc/jflex parser. It would be a lot more work, though.

P.P.S. I could do this for you. Recently I've been doing something very similar to this. The World Fact eBook I converted required several runs through various cleanup programs in order to remove the excess web formatting.
Nate the great is offline   Reply With Quote