MobileRead Forums - View Single Post - DOCX Conversion Handler

BetterRed · 06-08-2013, 03:25 AM

I have only used the new handler to convert DOCX files to EPUB. The EPUB files created from the new internal DOCX handler are ~80% larger than I was getting out the DOCX_Input plugin on the same input files.

I started using the DOCX_Input plugin several few weeks ago. Initially I had some issues with it and its integration into Calibre conversion; but once I overcame them... thanks to Kovid and SaliusP... I was delighted with the results. The DOCX files were much smaller than the RTF equivalents, and the EPUBs from DOCX_INPUT conversion were also smaller than those created from the RTF's. And DOC->EPUB conversion was also noticeably faster.

The following file sizes are for a 22,000 word no frills document with no cover. It's typical of my so-called books, they're mainly papers (Law, PPE) from academia, public institutions and the media.

RTF 878KB
EPUB FROM RTF 118KB
DOCX 97KB
EPUB FROM DOCX_INPUT 67KB

Net saving - 996 - 164 = 832KB (83%) the bulk of which is due to the DOCX v RTF savings.

However the new built-in DOCX conversion handler results in an EPUB that is up to 80% larger than I have been getting from DOCX_INPUT. For the above example its 120KB v 67KB, and it's a tad larger than a conversion from RTF.

A big plus for the DOCX_INPUT plug-in was that I could eyeball the HTML in Tweak books and Sigil and discern the actual text. This was something I could never do with the conversions from RTF, because there were so many HTML artefacts surrounding the actual text.

With the DOCX_INPUT plugin I get this for a mid-paragraph sentence in the EPUB HTML

Code:

The reputations of bankers were made blah blah. &nbsp;

With the inbuilt conversion of RTF to EPUB I get this for the same sentence

Code:

<span class="none1">The reputations of bankers were made blah blah</span>. <span class="none1">&nbsp;</span>

With the inbuilt conversion of DOCX to EPUB I get this for the same sentence

Code:

<span class="text4">The reputations of bankers were made blah blah</span><span>.</span> <span class="text4"><span class="calibre2">&nbsp;</span></span>

Visually all three look the same. I have been assuming (or maybe I was lead to believe) that the <spans></spans> bloat in the RTF conversions were the result of MS artefacts, but now I have to wonder if that's correct.

The vast bulk of documents I convert only have one font (nominally Times Roman) - I use bold, italics, small caps and single underline. and its all in one colour (nominally black on white).

For one file, size doesn't matter at all - but on thousands of 'books' it can be the difference between a 32GB thumb and a 64GB thumb, but even then size may not matter much in terms of storage. However it really matters when you're on the end of a flaky satellite link running at a nominal 64Kb/sec as some of my colleagues are, and as I sometimes am... we exchange 'books' almost every day...

Is there anything I can do in Calibre to get back to what I was getting out of the DOCX_INPUT plugin - eyeball readable paragraphs in the HTML, and EPUB files that are up to half as big as they are now. Or is there any prospect of adjustments being made to Calibre to achieve a similar result.

Going back to and sticking with 0.9.33 and DOCX_INPUT would contravene my "keep software up to date" policy. Changing from Word is not an option.

I'm also seeing the H1 page break problems reported by SauliusP, but I am not too bothered by cosmetic issues as I'm sure they'll get fixed in future releases.

BR

06-08-2013, 03:25 AM	#1
BetterRed null operator (he/him) Posts: 22,034 Karma: 30277960 Join Date: Mar 2012 Location: Sydney Australia Device: none	DOCX Conversion Handler - Observations I have only used the new handler to convert DOCX files to EPUB. The EPUB files created from the new internal DOCX handler are ~80% larger than I was getting out the DOCX_Input plugin on the same input files. I started using the DOCX_Input plugin several few weeks ago. Initially I had some issues with it and its integration into Calibre conversion; but once I overcame them... thanks to Kovid and SaliusP... I was delighted with the results. The DOCX files were much smaller than the RTF equivalents, and the EPUBs from DOCX_INPUT conversion were also smaller than those created from the RTF's. And DOC->EPUB conversion was also noticeably faster. The following file sizes are for a 22,000 word no frills document with no cover. It's typical of my so-called books, they're mainly papers (Law, PPE) from academia, public institutions and the media. RTF 878KB EPUB FROM RTF 118KB DOCX 97KB EPUB FROM DOCX_INPUT 67KB Net saving - 996 - 164 = 832KB (83%) the bulk of which is due to the DOCX v RTF savings. However the new built-in DOCX conversion handler results in an EPUB that is up to 80% larger than I have been getting from DOCX_INPUT. For the above example its 120KB v 67KB, and it's a tad larger than a conversion from RTF. A big plus for the DOCX_INPUT plug-in was that I could eyeball the HTML in Tweak books and Sigil and discern the actual text. This was something I could never do with the conversions from RTF, because there were so many HTML artefacts surrounding the actual text. With the DOCX_INPUT plugin I get this for a mid-paragraph sentence in the EPUB HTML Code: The reputations of bankers were made blah blah.   With the inbuilt conversion of RTF to EPUB I get this for the same sentence Code: <span class="none1">The reputations of bankers were made blah blah</span>. <span class="none1"> </span> With the inbuilt conversion of DOCX to EPUB I get this for the same sentence Code: <span class="text4">The reputations of bankers were made blah blah</span><span>.</span> <span class="text4"><span class="calibre2"> </span></span> Visually all three look the same. I have been assuming (or maybe I was lead to believe) that the <spans></spans> bloat in the RTF conversions were the result of MS artefacts, but now I have to wonder if that's correct. The vast bulk of documents I convert only have one font (nominally Times Roman) - I use bold, italics, small caps and single underline. and its all in one colour (nominally black on white). For one file, size doesn't matter at all - but on thousands of 'books' it can be the difference between a 32GB thumb and a 64GB thumb, but even then size may not matter much in terms of storage. However it really matters when you're on the end of a flaky satellite link running at a nominal 64Kb/sec as some of my colleagues are, and as I sometimes am... we exchange 'books' almost every day... Is there anything I can do in Calibre to get back to what I was getting out of the DOCX_INPUT plugin - eyeball readable paragraphs in the HTML, and EPUB files that are up to half as big as they are now. Or is there any prospect of adjustments being made to Calibre to achieve a similar result. Going back to and sticking with 0.9.33 and DOCX_INPUT would contravene my "keep software up to date" policy. Changing from Word is not an option. I'm also seeing the H1 page break problems reported by SauliusP, but I am not too bothered by cosmetic issues as I'm sure they'll get fixed in future releases. BR Last edited by BetterRed; 06-08-2013 at 03:36 AM. Reason: clarity