MobileRead Forums - View Single Post - DOCX Conversion Handler

kovidgoyal · 06-12-2013, 07:40 AM

FYI, I've written a markup analyzer that greatly reduces the markup redundancy produced by Word for the typical, low formatting text document. For example,

Code:

<p class="block_1"><span class="text_3">Small Felonies</span><span class="text_1">, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—</span><span class="text_3">and</span><span class="text_1"> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of </span><span class="text_3">any</span><span class="text_1"> kind in the mystery field is a rare treat.</span></p>

becomes

Code:

<p class="block_1"><i>Small Felonies</i>, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—<i>and</i> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of <i>any</i> kind in the mystery field is a rare treat.</p>

On one test book this reduces 4.8MB of Word markup to 0.5MB of HTML + CSS, which is an order of magnitude. (Pre analyzer the HTML+CSS was 1MB)

These are pre-compression sizes so the actual space savings will not be nearly as high, for example that 4.8 MB compresses down to about 300KB.

More important than the space savings, is of course the fact that the markup becomes much more human friendly. And all this in just 200 lines of code

Assuming I dont find any show-stopper bugs while testing, it will be in the next release.

06-12-2013, 07:40 AM	#13
kovidgoyal creator of calibre Posts: 45,445 Karma: 27757438 Join Date: Oct 2006 Location: Mumbai, India Device: Various	FYI, I've written a markup analyzer that greatly reduces the markup redundancy produced by Word for the typical, low formatting text document. For example, Code: <p class="block_1"><span class="text_3">Small Felonies</span><span class="text_1">, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—</span><span class="text_3">and</span><span class="text_1"> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of </span><span class="text_3">any</span><span class="text_1"> kind in the mystery field is a rare treat.</span></p> becomes Code: <p class="block_1"><i>Small Felonies</i>, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—<i>and</i> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of <i>any</i> kind in the mystery field is a rare treat.</p> On one test book this reduces 4.8MB of Word markup to 0.5MB of HTML + CSS, which is an order of magnitude. (Pre analyzer the HTML+CSS was 1MB) These are pre-compression sizes so the actual space savings will not be nearly as high, for example that 4.8 MB compresses down to about 300KB. More important than the space savings, is of course the fact that the markup becomes much more human friendly. And all this in just 200 lines of code Assuming I dont find any show-stopper bugs while testing, it will be in the next release.