06-08-2013, 02:25 AM | #1 |
null operator (he/him)
Posts: 20,583
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
DOCX Conversion Handler - Observations
I have only used the new handler to convert DOCX files to EPUB. The EPUB files created from the new internal DOCX handler are ~80% larger than I was getting out the DOCX_Input plugin on the same input files.
I started using the DOCX_Input plugin several few weeks ago. Initially I had some issues with it and its integration into Calibre conversion; but once I overcame them... thanks to Kovid and SaliusP... I was delighted with the results. The DOCX files were much smaller than the RTF equivalents, and the EPUBs from DOCX_INPUT conversion were also smaller than those created from the RTF's. And DOC->EPUB conversion was also noticeably faster. The following file sizes are for a 22,000 word no frills document with no cover. It's typical of my so-called books, they're mainly papers (Law, PPE) from academia, public institutions and the media. RTF 878KB EPUB FROM RTF 118KB DOCX 97KB EPUB FROM DOCX_INPUT 67KB Net saving - 996 - 164 = 832KB (83%) the bulk of which is due to the DOCX v RTF savings. However the new built-in DOCX conversion handler results in an EPUB that is up to 80% larger than I have been getting from DOCX_INPUT. For the above example its 120KB v 67KB, and it's a tad larger than a conversion from RTF. A big plus for the DOCX_INPUT plug-in was that I could eyeball the HTML in Tweak books and Sigil and discern the actual text. This was something I could never do with the conversions from RTF, because there were so many HTML artefacts surrounding the actual text. With the DOCX_INPUT plugin I get this for a mid-paragraph sentence in the EPUB HTML Code:
The reputations of bankers were made blah blah. Code:
<span class="none1">The reputations of bankers were made blah blah</span>. <span class="none1"> </span> Code:
<span class="text4">The reputations of bankers were made blah blah</span><span>.</span> <span class="text4"><span class="calibre2"> </span></span> The vast bulk of documents I convert only have one font (nominally Times Roman) - I use bold, italics, small caps and single underline. and its all in one colour (nominally black on white). For one file, size doesn't matter at all - but on thousands of 'books' it can be the difference between a 32GB thumb and a 64GB thumb, but even then size may not matter much in terms of storage. However it really matters when you're on the end of a flaky satellite link running at a nominal 64Kb/sec as some of my colleagues are, and as I sometimes am... we exchange 'books' almost every day... Is there anything I can do in Calibre to get back to what I was getting out of the DOCX_INPUT plugin - eyeball readable paragraphs in the HTML, and EPUB files that are up to half as big as they are now. Or is there any prospect of adjustments being made to Calibre to achieve a similar result. Going back to and sticking with 0.9.33 and DOCX_INPUT would contravene my "keep software up to date" policy. Changing from Word is not an option. I'm also seeing the H1 page break problems reported by SauliusP, but I am not too bothered by cosmetic issues as I'm sure they'll get fixed in future releases. BR Last edited by BetterRed; 06-08-2013 at 02:36 AM. Reason: clarity |
06-08-2013, 03:54 AM | #2 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Do the following little experiments:
1) Unzip a docx file and open document.xml in a text editor, that should tell you whether the conversion is generating extra markup or not. Hint, the answer is it isn't. The HTML markup is an almost literal translation of the markup in the docx. Every <span> in the HTML (with a couple of exceptions) corresponds to a <w:t> in the docx markup. 2) Try converting this docx file using the docx input plugin: http://calibre-ebook.com/downloads/demos/demo.docx That will show you just how much formatting it throws away. That said, optimizing the markup generated by the conversion is on my todo list. As I said, the current markup is an almost literal translation, there is scope for analyzing and optimizing the generated markup. |
Advert | |
|
06-08-2013, 07:08 AM | #3 |
null operator (he/him)
Posts: 20,583
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Thanks Kovid, fully understood
Didn't occur to me that the inbuilt handler would produce bigger files, was hoping they be even smaller I wasn't aware of limitations of DOCX_Input (as revealed by demo.docx), because once I overcame those integration issues, the PI satisfied a 100% of my needs, most of my wishes and delivered some nice surprises. The documents we're dealing with are mainly simple text, if there are complexities we retain the original (99% PDF) - and simply extract the text for analysis. for the changes to author link in Book Details - now I can set up my pseudonym links Request for Comment say I set up a 0.9.33 portable with the DOCX_Input PI installed say I create a symlink in its folder called Calibre Library that 'points' to my real library E:\Calibre Libraries\Main then I could do my DOCX->EPUB Conversions from that context and use installed calibre for other stuff. Just an idea. BR Last edited by BetterRed; 06-08-2013 at 07:38 AM. Reason: add RFC |
06-08-2013, 08:31 AM | #4 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You can certainly keep a portable install around and point it to the same library if you want to use the plugin for docx. calibre's db format has not changed in years, so it should be safe. That said, keep an eye on the changelog to see if there are any db format changes. If you want to be absolutely secure you should probably write a bat script that does the conversions using the portable install and updates the epubs in place in the main library.
|
06-08-2013, 10:15 AM | #5 | |
null operator (he/him)
Posts: 20,583
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
Can, and more importantly should I run bat files that invoke portable CLI programs whilst the installed GUI is running against the same database. BR |
|
Advert | |
|
06-08-2013, 11:36 AM | #6 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
No, you shouldn't.
|
06-08-2013, 10:00 PM | #7 |
null operator (he/him)
Posts: 20,583
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Then I won't
I'm thinking that as part of saving a file in Word, I'll interrogate the path and if its in my Calibre Libraries folder then I'll write a line to the 'current DOCX conversion queue', which will be be actioned when the GUI exits. I'll also invoke the Count Pages and Modify PI's on the resultant EPUBS I'm hoping of course that this approach will have a limited half life because you'll find ways to optimise the output. slightly - have you any plans to add a simple task builder to the calibre GUI. Example - Edit Metadata->Convert->Count Pages->Modify as a chained set of steps. That's a sequence I use a lot, and because of interrupts and attention drift I sometimes miss a step. BR |
06-08-2013, 10:02 PM | #8 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
No, cant say I have, but patches are always welcome This should be doable as a plugin, one that allows the user to simply list and chain actions, somewhat like the favorites plugin.
|
06-11-2013, 11:53 PM | #9 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@BR: Attach one of these docx files that show a size increase of the epub comapred to the docx. I'm guessing the size increase is because of the generated cover, but it would be helpful to have a sample to be sure.
|
06-12-2013, 03:19 AM | #10 | |
null operator (he/him)
Posts: 20,583
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
The difference I initially noticed was all the HTML span tags, and that lead me to look at file sizes, added 2+2 and got 5. The compression reduces all the <span class="calibre2"></span> and similar pairs to a fraction of actual size - I should have realised that. I just did the conversions on that same document again - taking care to remove the cover on BOTH epubs. The built handler produced an epub of 67,506 bytes, a smidgin larger than than the DOCX_Input epub (66,458 bytes). And in terms of disk space, they're exactly the same (69,632 bytes). Attachment shows the difference in HTML files size (I moved HTML files into folders) - lots of markup I don't really need. If I had a choice between reducing the markup from DOCX conversions, and having an option in EPUB Output to NOT include the cover then I would choose the latter. I run Modify after every conversion, I only use Tweak Book occasionally, and if it's 'too hard' I can use Sigil, and if that's too hard I can go back to Word. BR Last edited by BetterRed; 06-12-2013 at 03:22 AM. Reason: forgot attachment |
|
06-12-2013, 03:34 AM | #11 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you dont want a cover, you have to have no cover in the GUI and use the option in EPUB output to not generate a default cover.
|
06-12-2013, 04:29 AM | #12 |
null operator (he/him)
Posts: 20,583
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
I find having a generated cover in the GUI very useful, I use one of nine pictures to provide a visual cue as to the 'books' overall subject, also the Title and Author are visible at a glance.
I don't want the cover in the book, because in that context they serve no practical or aesthetic purpose - in other words they're a waste of space. Some folks are OCD about having the right/best cover, I guess I'm OCD about having no cover BR |
06-12-2013, 07:40 AM | #13 |
creator of calibre
Posts: 43,864
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
FYI, I've written a markup analyzer that greatly reduces the markup redundancy produced by Word for the typical, low formatting text document. For example,
Code:
<p class="block_1"><span class="text_3">Small Felonies</span><span class="text_1">, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—</span><span class="text_3">and</span><span class="text_1"> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of </span><span class="text_3">any</span><span class="text_1"> kind in the mystery field is a rare treat.</span></p> Code:
<p class="block_1"><i>Small Felonies</i>, therefore, is the first single-author collection of exclusively short-short—none is longer than two thousand words—<i>and</i> exclusively criminous stories. I make note of this fact with what I hope is pardonable pride. To have a first of <i>any</i> kind in the mystery field is a rare treat.</p> These are pre-compression sizes so the actual space savings will not be nearly as high, for example that 4.8 MB compresses down to about 300KB. More important than the space savings, is of course the fact that the markup becomes much more human friendly. And all this in just 200 lines of code Assuming I dont find any show-stopper bugs while testing, it will be in the next release. |
06-12-2013, 05:53 PM | #14 | |
null operator (he/him)
Posts: 20,583
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
Can it be anticipated that some operations will also be faster because there's less to compress/decompress and less markup. That was my experience when I used DOCX_Input - conversion, viewer open, various plugins like Modify and Count Pages were all noticeably faster. BR Last edited by BetterRed; 06-12-2013 at 06:10 PM. Reason: remove ambiguity |
|
06-14-2013, 05:26 PM | #15 |
Grand Sorcerer
Posts: 12,171
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
|
I tried reconverting your sample document, and see that the ePub output seems a lot larger than with the initial release.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
ePub to ePub conversion for Kobo Touch — some questions/observations, etc. | theboyk | Conversion | 13 | 10-02-2012 04:11 AM |
DOCX Input and DOCX Metadata Reader | SauliusP. | Development | 5 | 06-15-2012 02:17 AM |
MS Word .docx file conversion | Frank Lowney | Calibre | 1 | 06-01-2010 11:53 AM |
Conversion from .doc & .docx?? | LarryLaser | Calibre | 2 | 02-08-2010 10:48 PM |
'Voluminous' - a new ebook handler for OSX | ottocrat | News | 10 | 04-22-2008 09:45 AM |