View Single Post
Old 06-08-2012, 02:45 AM   #1
SauliusP.
Plugin developer
SauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notesSauliusP. can name that song in three notes
 
SauliusP.'s Avatar
 
Posts: 108
Karma: 24394
Join Date: Feb 2012
Location: Lithuania
Device: Kindle
[Input Plugin] DOCX Input

Hello,

THIS PLUGIN IS OBSOLETE FROM Calibre version 0.9.34, as a native plugin supersedes it. This plugin won't be supported anymore.

As an article writer I have lots of DOCX and tried to find good free alternative for DOCX to EPUB, AZW3 or MOBI conversion. However, good EPUB tools are not free, and Amazon's conversion service did not satisfy me, it makes formatting crappy and "not book like". So here they are, my own conversion tools. Please feel free to use them for your own purposes. Development will continue, I will constantly add new features. I was quite surprised there is no other plugin for Calibre, as DOCX format is comparatively simple.

DOCX Input plugin converts a DOCX file format to OEB (if I'm not mistaken, bunch of HTMLs with OPF file and CSS stylesheets). Then Calibre converts it to anything it supports. My main target is AZW3 (KF8) and MOBI, but no hacks included for better support.

Did you know that with this plugin you can view DOCX files in the Calibre without opening them in Word?! Just go to Settings > Behavior and tick the DOCX format in the list Use internal viewer for!

TTF and OTF (type "OTTO") font embedding is supported.
Note: it is your legal responsibility to embed fonts (check copyright before).
OTF is supported only of TTF type or type "OTTO", i.e. single font/family in the file. TrueType Collections are not yet supported.

The next post contains features, userguide, other information and also a demo docx file, to show off the supported features of the plugin.


SUPPORTED FEATURES
Spoiler:

1. Conversion to CSS and filtering of Word styles (only in-use styles are converted).
2. Paragraph properties: left, right indents, first line indent, last rendered page break (might be: manual page break, style-based page break, section break etc).
3. Images support. Wrapped around pictures are floated to left or right side. There is no alignment in Word itself, so I calculate it like this: if image is 7 centimeters (or more) off the left page boundary, I assume it is "right-aligned".
4. Tables (also multi-level table in a cell support).
5. Everything until first rendered page break is considered to be "a cover". I.e. most of my documents, that I convert, include some type of cover and a manual page break.
6. Font embedding of DejaVu Serif (included into plugin itself).
7. Footnotes are saved into individual HTML files and superscript links are added.
8. Paragraphs, that have TOC level styles applied (like Heading 1, 2 etc., or custom ones), are converted to appropriate level h1, h2 etc. HTML tags.
9. Font-sizes are converted to pt (same value, as you see in Word itself).
10. Indents are converted to em (just looks better).
11. Line breaks.
12. Options dialogue (via "Customize plugin"): Cover—force use first image in document, even if metadata contains another one on/off; drop content until first page break (assuming, that first page is just a cover image followed by a page break); embed fonts on/off, particularly useful when testing output with Calibre's EPUB viewer, which drops formatting because of Qt bug.
13. Strike-through (double strike-through is converted as single), subscript and superscript.
14. Underline support.
15. TTF font embedding.
16. Font face support (with embedding).
17. Lists support: numbered, bulleted, nested, continued.
18. OTF type "OTTO" embedding.
19. Paragraphs "before"/"after" setting.


NOT SUPPORTED
Spoiler:

1. Table styling. Now only collapsed 1px black borders are hard-coded.
2. Footnotes back-link.
3. No endnotes support and is not planned. If required, I convert all endnotes to footnotes beforehand.
4. Another fancy things, like vector graphics, OLEs, effects etc. Not planned either.


PLANNED
Spoiler:

1. Options to switch font-size units: em, pt, px, %.
2. Table styling (if not too difficult).
3. Embedding of the embedded fonts in the DOCX.


User Guide
Spoiler:

When you call conversion dialogue, there you'll see "DOCX Input" icon on the left. There you might want to choose some options to adjust conversion from DOCX.



1. Use first found image as cover. Default: ON. The very first picture in the document will be used as a book's cover during conversion.
2. Skip contents until first page break. Default ON. This is tightly tangled with above. I usually have a book like this: first page contains cover image and it is followed immediately by page/section break. So if I'm using this cover image as a cover, there is no need to repeat it again in the output book.
3. Replace paragraph spacing with empty lines. Default OFF. If paragraph has a "before" or "after" setting greater than 0 and option is ON, an empty paragraph will be included appropriately before or after it. Otherwise "before" or "after" will be set as a margin.
4. Embed fonts: All (all TTF fonts, found in document), DejaVu Serif (if you are not sure if it is legal to embed another fonts), None (when converting to font-unaware format, like MOBI).
5. Set "Normal" font family to "Serif". This is particulary useful, when one converts book to AZW3 (KF8) format and wants to leave majority of the text to be displayed in native Kindle font (Caecilia LT or another, configured by user). I.e. leave font family styling only for headers, captions and other types of highlighted text.
66. Scan fonts. It is to save some extra CPU and I/O cycles. Fonts are not installed very often, so it is best to scan for them occasionally. For the first time use click this button for plugin to gather all installed TTF fonts in your OS (tested on Windows and Linux, Mac font directories are also included).

To get best results Calibre should be also tuned a bit.
1. To generate TOC, go to Common Options, Table of Contents and add expressions for HTML headings (use wizard or input //h:h1 for Level 1 TOC, //h:h2 for Level 2 and //h:h3 for Level 3).
2. For EPUB conversion go to EPUB output options and tick "No default cover" and "No SVG cover".


All critiques, crashes and suggestions are most welcome, but I will not be quick in responses or new features development. At the moment I'm quite satisfied with plugins.

Version history:
Spoiler:
Version 0.0.22 2013-01-11
Fixed another image processing bug, when file is exported from another programs.

Version 0.0.21 2012-12-18
Fixed image processing bug. If there's only one image in the document and it is used as a cover, but it has also occurancies in other places, it dissapears from there.

Version 0.0.20 2012-12-17
Some formatting problems addressed with hanging indents, especially in lists. However, there will be some inaccuracies with lists. Kindle with KF8 supports them perfectly, older MOBI does not. Internal Calibre viewer shows everything nicely, but CoolReader application fails with negative first line indents (hanging indents). Demo DOCX sheet is also updated. Found why version history was not available in Calibre. Fixed it.

Version 0.0.19 2012-12-12
Bug fix, reported by Czech "book brothers", which caused plugin to crash. Includes numbering styles, previously not taken into account.

Version 0.0.18 2012-11-21
Long awaited (by some users) change: paragraph background colour ("shading" in Word terms) and characters background colour (a.k.a. "highlight").

Version 0.0.17 2012-11-16
New features:
  • Default font subfamily embedding, when required is not present. E.g. if one has only "Regular" font, but sets it to "Bold" in Word, "Regular" family will be included anyway. Supported default subfamilies are: "Regular", "Book", "Normal", "Medium". Rescan your fonts after this update!
  • Vertical paragraph spacing with "Before" and "After" settings. Also included a tick mark to replace "Before" and "After" with empty paragraphs for better e-reader compatibility.
  • Back-links in the footer text.

Version 0.0.16 2012-11-12
Bug Fixes:
  • Styles overriding in Word 2010 document and leaving some formatting behind.
  • Style naming in final CSS, overhead prefix added where not required.

Version 0.0.15 2012-10-08
New features:
  • Word 2010 styling included (stylesWithEffects.xml).
  • Special CSS style selector naming, conversion from Word styles, that start with number (invalid in CSS).
  • Few generic speed optimizations (non-functional).

Version 0.0.14 2012-10-04
Bug Fixes:
  • Fonts not embedded, when font-family is set directly, not via styles.
  • Type of "Book" subfamily not included instead of "Regular", when latter is not present.

Version 0.0.13 2012-10-02
New features:
  • OTF type "OTTO" font embedding.
  • "Normal" style font family substitution with generic "Serif" (mainly for KF8).

Version 0.0.12 2012-09-28
Bug Fixes:
  • Cover not generated.
  • Failure on table styles, if present in document.
New features:
  • TTF font embedding.
  • List conversion, nested lists as well, either numbered or bulletted.

Version 0.0.11 2012-09-17
Fixed bug with non-inline images, that caused crash of the plugin.

Version 0.0.10 2012-08-13
Fixed bug with skipping content until first page break when there is no page break in the document.

Version 0.0.9 2012-08-09
Plugin configuration (which is accessed very very inconveniently) finally changed to normal input options. Thanks to Kovid for enhancing Calibre's code to accept such a feature.

Not a version really 2012-07-18
Real motivation for new version release, as today received first donation. Thanks, Keith!

Version 0.0.8 2012-06-17
Bug Fixes:
  • Intermittent underline of text due to non-standard false-underline handling in text formatting tags.
New features:
  • Underline support.
  • Table width set to 100%.
  • Right-side alignment of pictures.

Version 0.0.7 2012-06-17
Bug Fixes:
  • Some text missing in paragraph. Due to "characters" method in SAX, sometimes it adds text in several chunks.
New features:
  • Strike-through, superscript and subscript support

Version 0.0.6 2012-06-12
Bug Fixes:
  • Page cover href pointed to html instead of image file
New features:
  • Customization dialogue added
  • Line-break support
  • A bit more distinguishable footnote link (Atlantis-like)

Version 0.0.5 2012-06-08
Initial release with few little bugs and initial features 1–10.

Attached Files
File Type: zip Calibre-DOCX-Input.v0.0.22.zip (699.6 KB, 36713 views)

Last edited by SauliusP.; 06-07-2013 at 04:26 AM.
SauliusP. is offline