MobileRead Forums - View Single Post - Sigil as front end for automated XML based processing workflows?

Toxaris · 01-10-2014, 02:52 AM

Quote:

Originally Posted by skreutzer

The HTML output of Microsoft Word can't be valid to any schema, since HTML isn't XML. Furthermore, as far as I know, there is no schema for HTML4, validation is done by DTD. I looked at the so called "XHTML" output of a recent Word version, and it wasn't even well-formed.

I never said HTML is XML, because it isn't. It is a different language al together. The only thing in common are the brackets... They can be combined and we call that XHTML, more or less.
However, you are making a mistake here. Word does NOT output XHTML, nor makes that claim. It can output HTML (in two flavours), XML (again in two flavours) and DOCX. Of course there are more formats, but lets ignore them for now.
The HTML output is valid HTML 4.01 by default. The problem most people have with it, that it is full of code to make sure the output in a browser resembles the original document AND that it can be understood by Word upon importing to make it a Word document again. It does that well enough, that it is not practical for subsequent processing is another story. That is also not the purpose.
The XML output is valid XML. The structure used is described in detail in the various websites from Microsoft. It has the same premise as the HTML output, that it must be understood by Word upon importing. That makes it less valuable for semantics. To give a short example of where issues will arise. Lets say I make a word italic. In the code <w:i /> (amongst other things) will be used to identify that it is italic. Now, when I create a style that applies italic, that code will not be there, but the code to apply the style. From the perspective from Word that makes sense, since italic is embedded in the style. From a semantic point of view it makes it a whole lot more difficult (the same applies for the HTML output btw). That also makes a whole lot harder to map it to other XML schemas.
I mention the docx format because that is essentially the same as the XML, only divided in multiple structured files in a container.

Quote:

Originally Posted by skreutzer

Well, you're right, but only up to a certain point. I know that some writers prefer to shoot themselves in the foot. Without doubt, I wouldn't even try to convince them to get good output files, with less or no manual work to produce e-books and PDFs, because they really like bad output files, much manual work and crappy e-books as well as crappy print results. For all the other writers, in case of a word processor, I would bring up a setup wizard at first start of the program, inform the user why and how he has to use styles, let him define several styles, and then work on the text. In general, not much would be different, except you couldn't just select a font or a font size. Even if font selection (and similar GUI components) would persist, you could change the font, but would be asked which style you're currently editing or if you want to create a new style, or you would automatically change the font at all portions of the text which are marked with the style that is currently selected. So there would be little difference for the writer (additionally, I would assume a writer writes text, while you assume that a writer does typesetting).

So basically you are suggesting creating YAWP (Yet Another Word Processor) that works semantically and pursuade all writers to use that one instead of the ones they are accustomed to like Word, OpenOffice, WordPerfect, etc. There is no way you get those corporations to change their export to your liking, how sane it may be.

Quote:

Originally Posted by skreutzer

From the description of your add-in on the websites linked in your signature, it looks like you're throwing out all direct formatting, but retain semantic style markup. So I wonder why you don't agree that a word processor should encourage semantic style markup and disable direct formatting, since the latter is obviously useless for all other software except the word processor itself.
...

Oh, but I do agree in part. I do not think that disabling direct formatting would a wise decision. It only is when the document is the first in a process. If the document is also the endstate, there is no reason to disable it.

Quote:

Originally Posted by skreutzer

Well, do you have any needs for your own projects?
...
Currently, I write semantic XHTML myself as input for conversion to EPUB, but as OpenOffice (therefore LibreOffice too, I assume) is already capable of valid, semantic XHTML output, I should probably work on a way to educate the author (video tutorial), a website to provide this education, a list of style names to use in OpenOffice, an upload form for the author to submit OpenOffice XHTML output on the mentioned website, and a schema to check if the uploaded file matches the expected style names, so that the file then could be automatically be processed to EPUB, and later to PDF.

No, not really. For my own work the add-in works fine and I made it available for others to use in case they would find it useful. It is a real time-saver for me and the results are much, much cleaner.
I tried to work with OpenOffice, but it is just not for me. I miss several features (not for ePUB creation) and don't like the interface. I also do not like the output to be honest and I am not the only one. There is a reason why there is also a program to take the output from OpenOffice to prepare it for ePUB. I believe it is called ePUBWriter.