View Single Post
Old 03-05-2021, 11:56 AM   #219
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,884
Karma: 6120478
Join Date: Nov 2009
Device: many
For those interested ...

I ran a few tests using my father-in-laws Memoirs from his time escaping Poland immediately after the war as a boy to come to Canada. It was originally in Word docx format (some 34 meg in size due to lots of photos and maps and tables). Of course not one style was used when the Word document was first written. As others have said, effectively using styles in Word is not something many people seem to do.

I tried DOCXImport, pandoc from docx to epub3, LibreOffice using "Save as Copy" and "Export as EPUB (to EPUB3)", then I tried pandoc from the odt directly to epub3.

Some of these did not work well but a combination of them did.

For example the DOCXImport and Pandoc both barfed on the Images that had originally been jpegs but somehow inside the docx became .emf (or something like that?) image files.

Pandoc just passed them along like they were a primary format in epub3 (they are not, and no browser supports them, at least on macOS). DocXImport just copied in image placeholders which was mentioned in its docs and so expected behaviour.

Pandoc messed up again when trying to convert from odt to epub3, generating only a single empty html file and no images at all. An epic fail with no error messages generated at all.

But when I used LibreOffice to read the .docx directly and then used "Save Copy" to export it to html, LibreOffice nicely converted all of the unreadable images to .gif files (which I can easily change to png) so none of the images were lost! That said the resulting html file was a bit messy with style tags in places, but ...

The LibreOffice "Export to EPUB3" dropped the table of contents completely for some reason and was messier than the "Save Copy" approach.

So the best overall "input" was obtained by mixing the DOCXImport and then overwriting the images with those generated by LibreOffice using the Save As Copy to get to the images.

By combining the two different approaches I got a very clean, very nice set of files ready to be cleaned up styles added, etc.

So if there was anyway to add a (.emf?) to .gif converter (or better yet .png) to the DOCXImport plugin that would make something very clean and nice.

I will try to see if I can find one.

In the future, when faced with this task again, I will probably try multiple approaches and then grab the best bits and pieces from each of them from to get the parts I want. Especially for images.

Hope this helps.

Edit:

It seems LibreOffice allows a headless mode that can be used to convert from .emf to png painlessly.

So if you have LibreOffice installed on Linux (and in your PATH) or macOS the following will work to do the conversions directly from the command line:

Linux:
Code:
libreoffice --headless --convert-to png image.emf
macOS:
Code:
/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to png image.emf
Both will place the image.png file right beside the input image.emf file. I am sure that LibreOffice for Windows can do something similar.

I am thinking about seeing if I can modify the DOCXImport plugin to check for LibreOffice being installed and convert the image files on the fly when I get a few free moments.

If it is not easy to modify mammoth then perhaps via post processing the html and image files.

Python's PIL is said to work with the older .wmf format and to convert them to svg (since they may contain both text and images) as well, but I have not tried it and others have reported some issues.

Last edited by KevinH; 03-05-2021 at 12:24 PM. Reason: add part about convert from .emf to png
KevinH is offline   Reply With Quote