MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Plugins (https://www.mobileread.com/forums/forumdisplay.php?f=268)
-   -   [Plugin] DOCXImport (https://www.mobileread.com/forums/showthread.php?t=273966)

DiapDealer 05-09-2016 09:36 AM

[Plugin] DOCXImport
 
2 Attachment(s)
DOCXImport: Import DOCX documents into Sigil as epubs.
(based on the Python Mammoth module)

** NOTE: this plugin periodically checks for updated versions by connecting to github (where the source is maintained). **
(this update check can be disabled via the GUI)

Minimum Sigil requirement: v0.9.0 or higher
Python Requirements: Python 3.4+ (Bundled or external)
OS Requirements: Windows/Linux/OS X
*** Linux users will have to make sure that the Tk or PyQt5 graphical python module is present if it's not already. On Debian-based flavors this can be done with "sudo apt-get install python3-tk" (or python3-pyqt5). On Arch distributions it can be done with pacman -S tk and/or pacman -S python-pyqt5)***

*Note: Do not rename any Sigil plugin zip files before attempting to install them *

Select a pre-existing DOCX file using the file dialog and it will be imported as a single-file epub.

The following features are currently supported (provided by Mammoth):
  • Headings.
  • Lists.
  • Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
  • Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
  • Footnotes and endnotes.
  • Images **NOTE: WMF/EMF images are unsupported and will be ignored.**
  • Bold, italics, underlines, strikethrough, superscript and subscript.
  • Links.
  • Line breaks.
  • Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.

An example of a style map (as well as a sample docx and css file file it will work with) are in the samples.zip attached to this post. More info on writing custom style maps can be found in the "Writing Style maps" section of Mammoth's README.

DOCXImport's code is hosted/maintained on Github.

The very latest version (and all previous versions) of DOCXImport can always be found on its Github Releases Page.

Changes
Spoiler:

v0.1.0
- Initial release
v0.2.0
- added gui
- added ability to employ custom style maps and custom css files
- dropped Python 2.7.x support
v0.2.1
- fixed some widget clipping situations
- changed icon
v0.2.2
- corrected some non-compliant opf issues when importing as EPUB3
v0.2.3
- use PyQt5 GUI if Sigil is new enough.
- integrate upstream changes to mammoth module
v0.2.4
- Update mammoth/cobble modules to latest upstream
- Tweak mammoth to create a "title" attribute for images if the title property is defined in the docx
- Remove extraneous parsimonious module
v0.2.5
- Add empty alt attribute to images when no alt_text/descr is provided in the DOCX
v0.2.6
- Fix empty paragraph regex bug (thanks @BeckyEbook)
v0.2.7
- Update upstream Mammoth library
- Add support to match Sigil's light/dark theme in Sigil 1.1
- Re-enable translation support (translator's wanted)
v0.2.8
- Import to a flat archive structure; Sigil 1.0+ users can restructure as they wish

KevinH 05-09-2016 10:51 AM

Wow!

Nicely done!

KevinH

DiapDealer 05-09-2016 12:22 PM

Thanks! Mammoth is truly doing all the heavy-lifting at the moment, though. I just made the necessary modules portable and namespaced them (so they could never potentially conflict with the PyPI versions) and slapped a Sigil plugin wrapper around them.

I was pleasantly surprised at how well mammoth performed out of the box. Now I need to familiarize myself with it more so I can start tweaking.

JSWolf 05-11-2016 06:39 PM

Which produces a better ePub, this plugin or Calibre?

DiapDealer 05-11-2016 07:06 PM

Quote:

Originally Posted by JSWolf (Post 3315978)
Which produces a better ePub, this plugin or Calibre?

Can we please not turn this into a competition? For the moment, this Sigil plugin does little but create a barebones epub. There's no css being generated (currently) so that will have to be supplied by the user after-the-fact. It does make some pretty-clean html, though (provided the docx was styled relatively competently). But it should certainly be considered a work in progress right now.

st_albert 05-12-2016 08:46 PM

Well, FWIW it works on my Kubuntu 14.04, 32-bit system (sigil 0.9.4).

I didn't have a genuine .docx file handy, so I loaded an .odt into LibreOffice and saved it as .docx -- which leads to my question...

The .odt document had been styled with several custom paragraph and character styles, but these were not preserved (i.e. not even the class names) in the epub. It did identify headers (all coded as h1) and all paragraphs as plain p, regardless of whatever style was used in the original document.

Is this to be expected at this stage, or is it because the .docx via LibreOffice isn't quite legit?

Anyway, quite an interesting plugin!

Albert

DiapDealer 05-12-2016 09:39 PM

Quote:

Originally Posted by st_albert (Post 3316781)
Is this to be expected at this stage, or is it because the .docx via LibreOffice isn't quite legit?

Custom style mappings are an inherent feature of the underlying Mammoth Python Module. I just haven't knocked together a way for users to make/use/save their own custom style maps with the plugin yet. I hope to soon. It shouldn't really matter whether the docx was made with Word or LibreOffice in that regard.

JSWolf 05-13-2016 03:07 PM

Quote:

Originally Posted by DiapDealer (Post 3316002)
Can we please not turn this into a competition? For the moment, this Sigil plugin does little but create a barebones epub. There's no css being generated (currently) so that will have to be supplied by the user after-the-fact. It does make some pretty-clean html, though (provided the docx was styled relatively competently). But it should certainly be considered a work in progress right now.

It's not a competition. It was a valid question to know whether to use the plugin or use Calibre.

DiapDealer 05-13-2016 06:06 PM

Quote:

Originally Posted by JSWolf (Post 3317238)
It's not a competition. It was a valid question to know whether to use the plugin or use Calibre.

It's also sort of an impossible question to answer. What constitutes a "better epub" is entirely subjective: what one person loves--another hates. It was also an unnecessary question since you could easily try both and answer your own question.

st_albert 05-15-2016 02:49 PM

Quote:

Originally Posted by DiapDealer (Post 3316804)
Custom style mappings are an inherent feature of the underlying Mammoth Python Module. I just haven't knocked together a way for users to make/use/save their own custom style maps with the plugin yet. I hope to soon. It shouldn't really matter whether the docx was made with Word or LibreOffice in that regard.

Yes, I see that now that I've read up a little on Mammoth. Perhaps all it would take would be a preference item that passes a pointer to the style_map file; the default of which could contain a simple demonstration of the syntax for style mapping. Looks pretty flexible, btw.

As you are no doubt aware, the Writer2Latex add-on package for Libre/Open office contains, besides the add-ons, a stand-alone java utility that can be used to produce an epub from an open-document (.odt) file. Since, for my sins, I'm the guy that gets to clean up and normalize the word .doc files from the authors for conversion to epub or placement in InDesign, I use writer2latex a lot.

Seems like Mammoth is a work-alike program. How nice it would be to directly import the .odt or .docx file into sigil without the manual conversion step!

ETA: Of course the add-in will export an epub directly from LibreOffice, but the stand-alone is much more flexible (IMHO) and it's easy to modify the configuration as needed.

roger64 05-21-2016 05:18 AM

Quote:

Originally Posted by st_albert (Post 3316781)
.../... It did identify headers (all coded as h1) and all paragraphs as plain p, regardless of whatever style was used in the original document..../...
Anyway, quite an interesting plugin!

Albert

This is to confirm these findings.
I did a test with two genuine - and plain - docx files.

- structure (chapter titles h1) was kept. I only had to recreate a toc.ncx to get a brand new one.
- paragraphs are all transformed to plain p
- italics are kept
- footnotes with returns link were all correctly kept.

This plugin can already save a lot of time.

Edit: I did a test with a loong book with h1 and h2 headings and there was no problem.

* I also use writer2latex (however not the standalone Java tool but the writer2xhtml extension for LibreOffice). It's very precise and highly comendable.

Toxaris 05-21-2016 06:26 AM

There is a good change that if there is an used style that contains italic (or bold, etc) that when the paragraph is transformed to plain p, the italic will be gone.

roger64 05-21-2016 07:13 AM

You are right. The italics that were kept were plain words or expressions between em tags. Sorry for this.

Toxaris 05-21-2016 10:22 AM

I know the problem/issue that causes this and it is difficult to avoid. The only way to 'solve' this, is to examine the word style and then apply the italics to the words/paragraphs that have that style directly before converting them to a standard paragraph. This is not that easy as it sounds though... It is also nothing that Diap can easily solve, this should be part of the mammoth library.

I have a lot of experience in these kind of issues due to my work on the add-in. That is part of the reason I ended up using another method of generating html.

DiapDealer 05-21-2016 10:58 AM

My plan is to leave it entirely up to the user through Mammoth's style mappings and the users's own css templates. ;)

I'm not really envisioning this plugin being used by user A to convert user B, C, D, and E's docx files automagically. I envision it being used by a writer/user who's adapted a standard for styling all their docx documents. That way, they create a custom Mammoth style-map (or a few) and an associated stylesheet. Once they have that in place, they can focus on creating their Word/LibreOffice documents. The style-map will take care of mapping all standard and custom docx styles/headings to specific html/class-names (with associated css).

In other words ... documents will be created that conform with a pre-existing style-map/css, rather than creating a style-map/css to accommodate each particular document (though the latter is still doable provided the user doesn't mind the extra work).


All times are GMT -4. The time now is 08:40 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.