DOCX Conversion Handler - Observations

BetterRed · 06-08-2013, 02:25 AM

I have only used the new handler to convert DOCX files to EPUB. The EPUB files created from the new internal DOCX handler are ~80% larger than I was getting out the DOCX_Input plugin on the same input files.

I started using the DOCX_Input plugin several few weeks ago. Initially I had some issues with it and its integration into Calibre conversion; but once I overcame them... thanks to Kovid and SaliusP... I was delighted with the results. The DOCX files were much smaller than the RTF equivalents, and the EPUBs from DOCX_INPUT conversion were also smaller than those created from the RTF's. And DOC->EPUB conversion was also noticeably faster.

The following file sizes are for a 22,000 word no frills document with no cover. It's typical of my so-called books, they're mainly papers (Law, PPE) from academia, public institutions and the media.

RTF 878KB
EPUB FROM RTF 118KB
DOCX 97KB
EPUB FROM DOCX_INPUT 67KB

Net saving - 996 - 164 = 832KB (83%) the bulk of which is due to the DOCX v RTF savings.

However the new built-in DOCX conversion handler results in an EPUB that is up to 80% larger than I have been getting from DOCX_INPUT. For the above example its 120KB v 67KB, and it's a tad larger than a conversion from RTF.

A big plus for the DOCX_INPUT plug-in was that I could eyeball the HTML in Tweak books and Sigil and discern the actual text. This was something I could never do with the conversions from RTF, because there were so many HTML artefacts surrounding the actual text.

With the DOCX_INPUT plugin I get this for a mid-paragraph sentence in the EPUB HTML

Code:

The reputations of bankers were made blah blah. &nbsp;

With the inbuilt conversion of RTF to EPUB I get this for the same sentence

Code:

<span class="none1">The reputations of bankers were made blah blah</span>. <span class="none1">&nbsp;</span>

With the inbuilt conversion of DOCX to EPUB I get this for the same sentence

Code:

<span class="text4">The reputations of bankers were made blah blah</span><span>.</span> <span class="text4"><span class="calibre2">&nbsp;</span></span>

Visually all three look the same. I have been assuming (or maybe I was lead to believe) that the <spans></spans> bloat in the RTF conversions were the result of MS artefacts, but now I have to wonder if that's correct.

The vast bulk of documents I convert only have one font (nominally Times Roman) - I use bold, italics, small caps and single underline. and its all in one colour (nominally black on white).

For one file, size doesn't matter at all - but on thousands of 'books' it can be the difference between a 32GB thumb and a 64GB thumb, but even then size may not matter much in terms of storage. However it really matters when you're on the end of a flaky satellite link running at a nominal 64Kb/sec as some of my colleagues are, and as I sometimes am... we exchange 'books' almost every day...

Is there anything I can do in Calibre to get back to what I was getting out of the DOCX_INPUT plugin - eyeball readable paragraphs in the HTML, and EPUB files that are up to half as big as they are now. Or is there any prospect of adjustments being made to Calibre to achieve a similar result.

Going back to and sticking with 0.9.33 and DOCX_INPUT would contravene my "keep software up to date" policy. Changing from Word is not an option.

I'm also seeing the H1 page break problems reported by SauliusP, but I am not too bothered by cosmetic issues as I'm sure they'll get fixed in future releases.

BR

kovidgoyal · 06-08-2013, 03:54 AM

Do the following little experiments:

1) Unzip a docx file and open document.xml in a text editor, that should tell you whether the conversion is generating extra markup or not. Hint, the answer is it isn't. The HTML markup is an almost literal translation of the markup in the docx. Every <span> in the HTML (with a couple of exceptions) corresponds to a <w:t> in the docx markup.

2) Try converting this docx file using the docx input plugin: http://calibre-ebook.com/downloads/demos/demo.docx That will show you just how much formatting it throws away.

That said, optimizing the markup generated by the conversion is on my todo list. As I said, the current markup is an almost literal translation, there is scope for analyzing and optimizing the generated markup.

BetterRed · 06-08-2013, 07:08 AM

Thanks Kovid, fully understood

Didn't occur to me that the inbuilt handler would produce bigger files, was hoping they be even smaller

I wasn't aware of limitations of DOCX_Input (as revealed by demo.docx), because once I overcame those integration issues, the PI satisfied a 100% of my needs, most of my wishes and delivered some nice surprises. The documents we're dealing with are mainly simple text, if there are complexities we retain the original (99% PDF) - and simply extract the text for analysis.

for the changes to author link in Book Details - now I can set up my pseudonym links

Request for Comment

say I set up a 0.9.33 portable with the DOCX_Input PI installed

say I create a symlink in its folder called Calibre Library that 'points' to my real library E:\Calibre Libraries\Main

then I could do my DOCX->EPUB Conversions from that context and use installed calibre for other stuff.

Just an idea.

BR

kovidgoyal · 06-08-2013, 08:31 AM

You can certainly keep a portable install around and point it to the same library if you want to use the plugin for docx. calibre's db format has not changed in years, so it should be safe. That said, keep an eye on the changelog to see if there are any db format changes. If you want to be absolutely secure you should probably write a bat script that does the conversions using the portable install and updates the epubs in place in the main library.

BetterRed · 06-08-2013, 10:15 AM

Quote:

Originally Posted by kovidgoyal

You can certainly keep a portable install around and point it to the same library if you want to use the plugin for docx. calibre's db format has not changed in years, so it should be safe. That said, keep an eye on the changelog to see if there are any db format changes. If you want to be absolutely secure you should probably write a bat script that does the conversions using the portable install and updates the epubs in place in the main library.

Excellent - I'm not sure if I already knew that PI's are available from the ebook-convert command - but I do now, just read the manual.

Can, and more importantly should I run bat files that invoke portable CLI programs whilst the installed GUI is running against the same database.

BR

kovidgoyal · 06-08-2013, 11:36 AM

No, you shouldn't.

BetterRed · 06-08-2013, 10:00 PM

Quote:

Originally Posted by kovidgoyal

No, you shouldn't.

Then I won't

I'm thinking that as part of saving a file in Word, I'll interrogate the path and if its in my Calibre Libraries folder then I'll write a line to the 'current DOCX conversion queue', which will be be actioned when the GUI exits.

I'll also invoke the Count Pages and Modify PI's on the resultant EPUBS

I'm hoping of course that this approach will have a limited half life because you'll find ways to optimise the output.

slightly - have you any plans to add a simple task builder to the calibre GUI.
Example - Edit Metadata->Convert->Count Pages->Modify as a chained set of steps. That's a sequence I use a lot, and because of interrupts and attention drift I sometimes miss a step.

BR

kovidgoyal · 06-08-2013, 10:02 PM

No, cant say I have, but patches are always welcome

This should be doable as a plugin, one that allows the user to simply list and chain actions, somewhat like the favorites plugin.

kovidgoyal · 06-11-2013, 11:53 PM

@BR: Attach one of these docx files that show a size increase of the epub comapred to the docx. I'm guessing the size increase is because of the generated cover, but it would be helpful to have a sample to be sure.

BetterRed · 06-12-2013, 03:19 AM

Quote:

Originally Posted by kovidgoyal

@BR: Attach one of these docx files that show a size increase of the epub comapred to the docx. I'm guessing the size increase is because of the generated cover, but it would be helpful to have a sample to be sure.

I think you're right - I don't keep covers in epubs. When I did the first conversion with the new DOCX handler I think I forgot to do a Modify to delete the cover.

The difference I initially noticed was all the HTML span tags, and that lead me to look at file sizes, added 2+2 and got 5. The compression reduces all the <span class="calibre2"></span> and similar pairs to a fraction of actual size - I should have realised that.

I just did the conversions on that same document again - taking care to remove the cover on BOTH epubs.

The built handler produced an epub of 67,506 bytes, a smidgin larger than than the DOCX_Input epub (66,458 bytes). And in terms of disk space, they're exactly the same (69,632 bytes).

Attachment shows the difference in HTML files size (I moved HTML files into folders) - lots of markup I don't really need.

If I had a choice between reducing the markup from DOCX conversions, and having an option in EPUB Output to NOT include the cover then I would choose the latter.

I run Modify after every conversion, I only use Tweak Book occasionally, and if it's 'too hard' I can use Sigil, and if that's too hard I can go back to Word.

BR

kovidgoyal · 06-12-2013, 03:34 AM

If you dont want a cover, you have to have no cover in the GUI and use the option in EPUB output to not generate a default cover.

BetterRed · 06-12-2013, 04:29 AM

I find having a generated cover in the GUI very useful, I use one of nine pictures to provide a visual cue as to the 'books' overall subject, also the Title and Author are visible at a glance.

I don't want the cover in the book, because in that context they serve no practical or aesthetic purpose - in other words they're a waste of space. Some folks are OCD about having the right/best cover, I guess I'm OCD about having no cover

BR

kovidgoyal · 06-12-2013, 07:40 AM

FYI, I've written a markup analyzer that greatly reduces the markup redundancy produced by Word for the typical, low formatting text document. For example,

Code:

<p class="block_1"><span class="text_3">Small Felonies</span><span class="text_1">, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—</span><span class="text_3">and</span><span class="text_1"> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of </span><span class="text_3">any</span><span class="text_1"> kind in the mystery field is a rare treat.</span></p>

becomes

Code:

<p class="block_1"><i>Small Felonies</i>, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—<i>and</i> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of <i>any</i> kind in the mystery field is a rare treat.</p>

On one test book this reduces 4.8MB of Word markup to 0.5MB of HTML + CSS, which is an order of magnitude. (Pre analyzer the HTML+CSS was 1MB)

These are pre-compression sizes so the actual space savings will not be nearly as high, for example that 4.8 MB compresses down to about 300KB.

More important than the space savings, is of course the fact that the markup becomes much more human friendly. And all this in just 200 lines of code

Assuming I dont find any show-stopper bugs while testing, it will be in the next release.

BetterRed · 06-12-2013, 05:53 PM

Quote:

Originally Posted by kovidgoyal

FYI, I've written a markup analyzer that greatly reduces the markup redundancy produced by Word for the typical, low formatting text document.

More important than the space savings, is of course the fact that the markup becomes much more human friendly. And all this in just 200 lines of code

Assuming I dont find any show-stopper bugs while testing, it will be in the next release.

Brilliant!

Can it be anticipated that some operations will also be faster because there's less to compress/decompress and less markup. That was my experience when I used DOCX_Input - conversion, viewer open, various plugins like Modify and Count Pages were all noticeably faster.

BR

PeterT · 06-14-2013, 05:26 PM

I tried reconverting your sample document, and see that the ePub output seems a lot larger than with the initial release.

06-08-2013, 02:25 AM	#1
BetterRed null operator (he/him) Posts: 20,583 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	DOCX Conversion Handler - Observations I have only used the new handler to convert DOCX files to EPUB. The EPUB files created from the new internal DOCX handler are ~80% larger than I was getting out the DOCX_Input plugin on the same input files. I started using the DOCX_Input plugin several few weeks ago. Initially I had some issues with it and its integration into Calibre conversion; but once I overcame them... thanks to Kovid and SaliusP... I was delighted with the results. The DOCX files were much smaller than the RTF equivalents, and the EPUBs from DOCX_INPUT conversion were also smaller than those created from the RTF's. And DOC->EPUB conversion was also noticeably faster. The following file sizes are for a 22,000 word no frills document with no cover. It's typical of my so-called books, they're mainly papers (Law, PPE) from academia, public institutions and the media. RTF 878KB EPUB FROM RTF 118KB DOCX 97KB EPUB FROM DOCX_INPUT 67KB Net saving - 996 - 164 = 832KB (83%) the bulk of which is due to the DOCX v RTF savings. However the new built-in DOCX conversion handler results in an EPUB that is up to 80% larger than I have been getting from DOCX_INPUT. For the above example its 120KB v 67KB, and it's a tad larger than a conversion from RTF. A big plus for the DOCX_INPUT plug-in was that I could eyeball the HTML in Tweak books and Sigil and discern the actual text. This was something I could never do with the conversions from RTF, because there were so many HTML artefacts surrounding the actual text. With the DOCX_INPUT plugin I get this for a mid-paragraph sentence in the EPUB HTML Code: The reputations of bankers were made blah blah.   With the inbuilt conversion of RTF to EPUB I get this for the same sentence Code: <span class="none1">The reputations of bankers were made blah blah</span>. <span class="none1"> </span> With the inbuilt conversion of DOCX to EPUB I get this for the same sentence Code: <span class="text4">The reputations of bankers were made blah blah</span><span>.</span> <span class="text4"><span class="calibre2"> </span></span> Visually all three look the same. I have been assuming (or maybe I was lead to believe) that the <spans></spans> bloat in the RTF conversions were the result of MS artefacts, but now I have to wonder if that's correct. The vast bulk of documents I convert only have one font (nominally Times Roman) - I use bold, italics, small caps and single underline. and its all in one colour (nominally black on white). For one file, size doesn't matter at all - but on thousands of 'books' it can be the difference between a 32GB thumb and a 64GB thumb, but even then size may not matter much in terms of storage. However it really matters when you're on the end of a flaky satellite link running at a nominal 64Kb/sec as some of my colleagues are, and as I sometimes am... we exchange 'books' almost every day... Is there anything I can do in Calibre to get back to what I was getting out of the DOCX_INPUT plugin - eyeball readable paragraphs in the HTML, and EPUB files that are up to half as big as they are now. Or is there any prospect of adjustments being made to Calibre to achieve a similar result. Going back to and sticking with 0.9.33 and DOCX_INPUT would contravene my "keep software up to date" policy. Changing from Word is not an option. I'm also seeing the H1 page break problems reported by SauliusP, but I am not too bothered by cosmetic issues as I'm sure they'll get fixed in future releases. BR Last edited by BetterRed; 06-08-2013 at 02:36 AM. Reason: clarity

06-08-2013, 07:08 AM	#3
BetterRed null operator (he/him) Posts: 20,583 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	Thanks Kovid, fully understood Didn't occur to me that the inbuilt handler would produce bigger files, was hoping they be even smaller I wasn't aware of limitations of DOCX_Input (as revealed by demo.docx), because once I overcame those integration issues, the PI satisfied a 100% of my needs, most of my wishes and delivered some nice surprises. The documents we're dealing with are mainly simple text, if there are complexities we retain the original (99% PDF) - and simply extract the text for analysis. for the changes to author link in Book Details - now I can set up my pseudonym links Request for Comment say I set up a 0.9.33 portable with the DOCX_Input PI installed say I create a symlink in its folder called Calibre Library that 'points' to my real library E:\Calibre Libraries\Main then I could do my DOCX->EPUB Conversions from that context and use installed calibre for other stuff. Just an idea. BR Last edited by BetterRed; 06-08-2013 at 07:38 AM. Reason: add RFC

06-12-2013, 07:40 AM	#13
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	FYI, I've written a markup analyzer that greatly reduces the markup redundancy produced by Word for the typical, low formatting text document. For example, Code: <p class="block_1"><span class="text_3">Small Felonies</span><span class="text_1">, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—</span><span class="text_3">and</span><span class="text_1"> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of </span><span class="text_3">any</span><span class="text_1"> kind in the mystery field is a rare treat.</span></p> becomes Code: <p class="block_1"><i>Small Felonies</i>, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—<i>and</i> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of <i>any</i> kind in the mystery field is a rare treat.</p> On one test book this reduces 4.8MB of Word markup to 0.5MB of HTML + CSS, which is an order of magnitude. (Pre analyzer the HTML+CSS was 1MB) These are pre-compression sizes so the actual space savings will not be nearly as high, for example that 4.8 MB compresses down to about 300KB. More important than the space savings, is of course the fact that the markup becomes much more human friendly. And all this in just 200 lines of code Assuming I dont find any show-stopper bugs while testing, it will be in the next release.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ePub to ePub conversion for Kobo Touch — some questions/observations, etc.	theboyk	Conversion	13	10-02-2012 04:11 AM
DOCX Input and DOCX Metadata Reader	SauliusP.	Development	5	06-15-2012 02:17 AM
MS Word .docx file conversion	Frank Lowney	Calibre	1	06-01-2010 11:53 AM
Conversion from .doc & .docx??	LarryLaser	Calibre	2	02-08-2010 10:48 PM
'Voluminous' - a new ebook handler for OSX	ottocrat	News	10	04-22-2008 09:45 AM

06-08-2013, 03:54 AM	#2
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Do the following little experiments: 1) Unzip a docx file and open document.xml in a text editor, that should tell you whether the conversion is generating extra markup or not. Hint, the answer is it isn't. The HTML markup is an almost literal translation of the markup in the docx. Every <span> in the HTML (with a couple of exceptions) corresponds to a <w:t> in the docx markup. 2) Try converting this docx file using the docx input plugin: http://calibre-ebook.com/downloads/demos/demo.docx That will show you just how much formatting it throws away. That said, optimizing the markup generated by the conversion is on my todo list. As I said, the current markup is an almost literal translation, there is scope for analyzing and optimizing the generated markup.

06-08-2013, 08:31 AM	#4
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can certainly keep a portable install around and point it to the same library if you want to use the plugin for docx. calibre's db format has not changed in years, so it should be safe. That said, keep an eye on the changelog to see if there are any db format changes. If you want to be absolutely secure you should probably write a bat script that does the conversions using the portable install and updates the epubs in place in the main library.

06-08-2013, 11:36 AM	#6
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	No, you shouldn't.

06-08-2013, 10:02 PM	#8
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	No, cant say I have, but patches are always welcome This should be doable as a plugin, one that allows the user to simply list and chain actions, somewhat like the favorites plugin.

06-11-2013, 11:53 PM	#9
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@BR: Attach one of these docx files that show a size increase of the epub comapred to the docx. I'm guessing the size increase is because of the generated cover, but it would be helpful to have a sample to be sure.

06-12-2013, 03:34 AM	#11
kovidgoyal creator of calibre Posts: 43,864 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you dont want a cover, you have to have no cover in the GUI and use the option in EPUB output to not generate a default cover.

06-12-2013, 04:29 AM	#12
BetterRed null operator (he/him) Posts: 20,583 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	I find having a generated cover in the GUI very useful, I use one of nine pictures to provide a visual cue as to the 'books' overall subject, also the Title and Author are visible at a glance. I don't want the cover in the book, because in that context they serve no practical or aesthetic purpose - in other words they're a waste of space. Some folks are OCD about having the right/best cover, I guess I'm OCD about having no cover BR

06-14-2013, 05:26 PM	#15
PeterT Grand Sorcerer Posts: 12,171 Karma: 73448616 Join Date: Nov 2007 Location: Toronto Device: Nexus 7, Clara, Touch, Tolino EPOS	I tried reconverting your sample document, and see that the ePub output seems a lot larger than with the initial release.

Advert

Advert