FYI, I've written a markup analyzer that greatly reduces the markup redundancy produced by Word for the typical, low formatting text document. For example,
Code:
<p class="block_1"><span class="text_3">Small Felonies</span><span class="text_1">, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—</span><span class="text_3">and</span><span class="text_1"> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of </span><span class="text_3">any</span><span class="text_1"> kind in the mystery field is a rare treat.</span></p>
becomes
Code:
<p class="block_1"><i>Small Felonies</i>, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—<i>and</i> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of <i>any</i> kind in the mystery field is a rare treat.</p>
On one test book this reduces 4.8MB of Word markup to 0.5MB of HTML + CSS, which is an order of magnitude. (Pre analyzer the HTML+CSS was 1MB)
These are pre-compression sizes so the actual space savings will not be nearly as high, for example that 4.8 MB compresses down to about 300KB.
More important than the space savings, is of course the fact that the markup becomes much more human friendly. And all this in just 200 lines of code
Assuming I dont find any show-stopper bugs while testing, it will be in the next release.