![]() |
[Plugin] DOCXImport
2 Attachment(s)
DOCXImport: Import DOCX documents into Sigil as epubs.
(based on the Python Mammoth module) ** NOTE: this plugin periodically checks for updated versions by connecting to github (where the source is maintained). ** (this update check can be disabled via the GUI) Minimum Sigil requirement: v0.9.0 or higher Python Requirements: Python 3.4+ (Bundled or external) OS Requirements: Windows/Linux/OS X *** Linux users will have to make sure that the Tk or PyQt5 graphical python module is present if it's not already. On Debian-based flavors this can be done with "sudo apt-get install python3-tk" (or python3-pyqt5). On Arch distributions it can be done with pacman -S tk and/or pacman -S python-pyqt5)*** *Note: Do not rename any Sigil plugin zip files before attempting to install them * Select a pre-existing DOCX file using the file dialog and it will be imported as a single-file epub. The following features are currently supported (provided by Mammoth):
An example of a style map (as well as a sample docx and css file file it will work with) are in the samples.zip attached to this post. More info on writing custom style maps can be found in the "Writing Style maps" section of Mammoth's README. DOCXImport's code is hosted/maintained on Github. The very latest version (and all previous versions) of DOCXImport can always be found on its Github Releases Page. Changes Spoiler:
|
Wow!
Nicely done! KevinH |
Thanks! Mammoth is truly doing all the heavy-lifting at the moment, though. I just made the necessary modules portable and namespaced them (so they could never potentially conflict with the PyPI versions) and slapped a Sigil plugin wrapper around them.
I was pleasantly surprised at how well mammoth performed out of the box. Now I need to familiarize myself with it more so I can start tweaking. |
Which produces a better ePub, this plugin or Calibre?
|
Quote:
|
Well, FWIW it works on my Kubuntu 14.04, 32-bit system (sigil 0.9.4).
I didn't have a genuine .docx file handy, so I loaded an .odt into LibreOffice and saved it as .docx -- which leads to my question... The .odt document had been styled with several custom paragraph and character styles, but these were not preserved (i.e. not even the class names) in the epub. It did identify headers (all coded as h1) and all paragraphs as plain p, regardless of whatever style was used in the original document. Is this to be expected at this stage, or is it because the .docx via LibreOffice isn't quite legit? Anyway, quite an interesting plugin! Albert |
Quote:
|
Quote:
|
Quote:
|
Quote:
As you are no doubt aware, the Writer2Latex add-on package for Libre/Open office contains, besides the add-ons, a stand-alone java utility that can be used to produce an epub from an open-document (.odt) file. Since, for my sins, I'm the guy that gets to clean up and normalize the word .doc files from the authors for conversion to epub or placement in InDesign, I use writer2latex a lot. Seems like Mammoth is a work-alike program. How nice it would be to directly import the .odt or .docx file into sigil without the manual conversion step! ETA: Of course the add-in will export an epub directly from LibreOffice, but the stand-alone is much more flexible (IMHO) and it's easy to modify the configuration as needed. |
Quote:
I did a test with two genuine - and plain - docx files. - structure (chapter titles h1) was kept. I only had to recreate a toc.ncx to get a brand new one. - paragraphs are all transformed to plain p - italics are kept - footnotes with returns link were all correctly kept. This plugin can already save a lot of time. Edit: I did a test with a loong book with h1 and h2 headings and there was no problem. * I also use writer2latex (however not the standalone Java tool but the writer2xhtml extension for LibreOffice). It's very precise and highly comendable. |
There is a good change that if there is an used style that contains italic (or bold, etc) that when the paragraph is transformed to plain p, the italic will be gone.
|
You are right. The italics that were kept were plain words or expressions between em tags. Sorry for this.
|
I know the problem/issue that causes this and it is difficult to avoid. The only way to 'solve' this, is to examine the word style and then apply the italics to the words/paragraphs that have that style directly before converting them to a standard paragraph. This is not that easy as it sounds though... It is also nothing that Diap can easily solve, this should be part of the mammoth library.
I have a lot of experience in these kind of issues due to my work on the add-in. That is part of the reason I ended up using another method of generating html. |
My plan is to leave it entirely up to the user through Mammoth's style mappings and the users's own css templates. ;)
I'm not really envisioning this plugin being used by user A to convert user B, C, D, and E's docx files automagically. I envision it being used by a writer/user who's adapted a standard for styling all their docx documents. That way, they create a custom Mammoth style-map (or a few) and an associated stylesheet. Once they have that in place, they can focus on creating their Word/LibreOffice documents. The style-map will take care of mapping all standard and custom docx styles/headings to specific html/class-names (with associated css). In other words ... documents will be created that conform with a pre-existing style-map/css, rather than creating a style-map/css to accommodate each particular document (though the latter is still doable provided the user doesn't mind the extra work). |
| All times are GMT -4. The time now is 08:40 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.