MobileRead Forums - View Single Post - Sigil as front end for automated XML based processing workflows?

skreutzer · 01-09-2014, 04:42 PM

Quote:

Originally Posted by Toxaris

Well, not quite. XML is meaningless. It is only a markup language and add structure. I can export whatever I want as XML, as long as I honor the structure. Without the schema however, the XML is useless. In the schema we define what the tags mean and how the structure should look like. XHTML is not a format, it is just XML with a (more or less) strictly defined schema.

Well, not quite. XML is somewhat self-descriptive, if good names are used. The absence of a schema hasn't any effect at all, you can still read and interpret a XML file. Even if a schema is available, the schema doesn't tell what a tag means, it just defines the structure. And even if you have a format specification document, you might still don't know how to implement tags which are defined in the schema. And I wonder how you define the word "format".

Quote:

Originally Posted by Toxaris

Word XML is just that. It is perfectly valid XML with a schema specifically for Word documents, just as the intention was. In principle it is possible to load the XML in Word and have your original document. The same applies for their HTML output. It is valid, even if it is not what we would like.

The HTML output of Microsoft Word can't be valid to any schema, since HTML isn't XML. Furthermore, as far as I know, there is no schema for HTML4, validation is done by DTD. I looked at the so called "XHTML" output of a recent Word version, and it wasn't even well-formed.

Quote:

Originally Posted by Toxaris

All XML 'formats' are custom, but some schemas are public and agreed upon by various parties.

With the term "custom" in the previous posts I referred to XML files in a structure defined by yourself, with or without schema, and even to such ones which are "uncommon" (for the purpose of this thread, also to XML definitions which are common in general, but less common in comparison with XHTML), with or without public schema.

Quote:

Originally Posted by Toxaris

That is also one of the issues. A schema needs to be agreed to correctly identify the semantic value of the tags. You cannot expect all (or any) wordprocessor to honor the schema you would like. So, you would need to map the XML schema from the wordprocessor to your schema. That will not always be possible.

No. Everybody can write his own schema and just validate his own files against it. Why would anybody want to do so? For software it is a very convenient way to check input, so the source code can trust that certain elements are there, instead of checking it all the time or with a lot of code. Also, if other programs or people provide you with their XML files in a custom structure, you could write your own schema for it, according to the elements your software will recognize (and adjust, if you discover new or different elements in files in the future). I would be glad if word processors would honor XHTML, and hopefully in the most semantic way.

Quote:

Originally Posted by Toxaris

You also greatly overestimate the willingness of writers to change their ways and their reaction to being forced to work in a certain way. They would rather use another program or even Wordpad than to change their wow. Only a small amount of writers is willing to do that.

Well, you're right, but only up to a certain point. I know that some writers prefer to shoot themselves in the foot. Without doubt, I wouldn't even try to convince them to get good output files, with less or no manual work to produce e-books and PDFs, because they really like bad output files, much manual work and crappy e-books as well as crappy print results. For all the other writers, in case of a word processor, I would bring up a setup wizard at first start of the program, inform the user why and how he has to use styles, let him define several styles, and then work on the text. In general, not much would be different, except you couldn't just select a font or a font size. Even if font selection (and similar GUI components) would persist, you could change the font, but would be asked which style you're currently editing or if you want to create a new style, or you would automatically change the font at all portions of the text which are marked with the style that is currently selected. So there would be little difference for the writer (additionally, I would assume a writer writes text, while you assume that a writer does typesetting).

Quote:

Originally Posted by Toxaris

You might take a look at my Word add-in. I create clean HTML output (or XHTML directly in an ePUB) out of Word, but at a price. Styling like margins and fonts will be removed. It would be relatively easy to create an export for another format (e.g. Markdown) in the same way.

From the description of your add-in on the websites linked in your signature, it looks like you're throwing out all direct formatting, but retain semantic style markup. So I wonder why you don't agree that a word processor should encourage semantic style markup and disable direct formatting, since the latter is obviously useless for all other software except the word processor itself. To save the time of the author, who potentially spends time with direct formatting, he could do something useful instead by applying templates, which your add-in could retain. Also, if the output of your add-in should be used for e-book or print preparation (or as input for an automated processing workflow), the output file needs to be extended with semantic markup, using Sigil. So not only the time of the author is wasted, if he uses direct formatting, also the time of the Sigil person is wasted, who has to do the semantic markup afterwards completely from scratch (in the worst case). In an ideal workflow, the author would do semantic markup with style templates initially (everywhere where he would use direct formatting anyway), all of it would be retained by your add-in, and if some markup still would be missing for preparation of e-book and print creation, the Sigil person would add just the missing markup. The key thing here is that the direct formatting is useless for the writer and the preparation guy in any case (and therefore a waste of time and resources), so a word processor does a bad job by allowing direct formatting. The developer of the word processor just gets away with it, because the author will find out about the consequences when it is far too late, and then not blame the developer of the word processor, but the poor formatting guy, because the root of the problem is unknown to the author.

Quote:

Originally Posted by Toxaris

I like the idea, but I think you are too optimistic. However, if I can help to improve things, I probably will.

Well, do you have any needs for your own projects? I'm mostly driven by my own personal need, currently just small "book" projects. But over time, I hope to provide more and more general purpose processing tools, which could be used by self-publishers or to set up an (online?) service. On the one hand, it's a lot of work and won't be sufficient for all kinds of uses within the first time, on the other hand if a solution is implemented once, a lot of texts can be processed with it. The problem to get good semantic XML will still need to be addressed, but that's exactly what I was wondering about if Sigil could be used for it (to let the author do the semantic markup of his text with Sigil if he failed to do it right in the first place, and then take the prepared EPUB (XHTML) file from Sigil as input for an automated processing system. But there are also alternative ways to get a semantic XML/XHTML file from the author, one could be to write a JavaScript based online/offline text editor for semantic editing. Currently, I write semantic XHTML myself as input for conversion to EPUB, but as OpenOffice (therefore LibreOffice too, I assume) is already capable of valid, semantic XHTML output, I should probably work on a way to educate the author (video tutorial), a website to provide this education, a list of style names to use in OpenOffice, an upload form for the author to submit OpenOffice XHTML output on the mentioned website, and a schema to check if the uploaded file matches the expected style names, so that the file then could be automatically be processed to EPUB, and later to PDF. I know how this description reads, but existing free software would provide short cuts, the development could be done collectively as free software, and over time the system would expand, so it could become a real option for self-publishers that would reduce manual labor for authors, formatters and developers. Maybe it would not be in the scope of the website, but depending on the interfaces, theoretically, somebody could from there distribute the prepared files directly to online e-book shops and print-on-demand services. As build as and with free software, that system would not be an online service by some provider, but could be set up by everybody online or offline. The free software license would make sure that every improvement is available to everybody else, so essentially a community would work together instead of competing against each other. I myself don't need necessarily such a large system, I'm glad to develop my own little system to use it for my book projects and maybe for people I work together with, and if it grows beyond that because my results are freely licensed, fine. In any case, I'm interested if somebody else does something similar with free software, and if there could be a joint effort to provide a common solution for a larger audience of people.