MobileRead Forums - View Single Post - Sigil as front end for automated XML based processing workflows?

skreutzer · 01-09-2014, 12:17 PM

Quote:

Originally Posted by Hitch

I'm now not really sure if we're saying the same thing, or different things. At this juncture, I don't see any tool, at all, that is assisting in providing clean, properly-formatted XML into Sigil or any other workflow. My comprehension of your posts is that this is what you were considering creating, as it's extremely unlikely that any of the current writing tools on the market, whether Word, Scrivener, etc., are going to go in that direction?

Yes, the main goal is getting well-formed, semantic XML. If such XML is XHTML, it should also be valid. This job could easily be done by different programs which are used to write, edit and transform texts, but not all programs care about the quality of their output or encourage practices to create such.

Quote:

Originally Posted by Hitch

I don't think that's hard; the truth is that you either take a word-processed document, or something that's been through, say, INDD, and you can a) clean it and b) then export it to HTML in order create an ePUB for instant commercial use, and then c) export it into MOBI for commercial use, or b) clean it to create semantic XML in the first place, which then has to be processed again to create an ePUB and/or MOBI. In the former case, you essentially run 1+ processes, in the latter it's 2+ or 3, as creating a mobi from a good ePUB is simplicity itself. I think it's as simple as, XML isn't natively suited for print or faux-print layout, as it's basically Markup. Writers and editors don't write in Markup.

b) only takes two steps, if the program to write the text failed to encourage/enforce the use of style templates in order to get semantic XML output initially, so the case of a tool to clean a XML output with 2+ steps is just a workaround (but won't be uncommon considering the current situation, I fear). I don't see why XML wouldn't be natively suited for print, since it is hierarchically structured like print layouts are. Transformation from XML to print is quite easy with XSLT to FO or to LaTeX, which will almost always result in better quality output in less time. If a self-publisher prepares more than one book (and probably even with no more than one book) for distribution, it will save time to automatically process input through the automated workflow, will save time to replicate edits at a later stage into all target formats, and add other or future target formats for all projects which were prepared as input for the workflow. One click could convert an entire collection of texts into a new output format.

Quote:

Originally Posted by Hitch

When Amazon came into the marketplace, they bought Mobipocket creator, and the Kindle ran on HTML 3.2. This drove the bookmaking market. I can't say I've done a boatload of XML cleanup, but the XML I've tried to export from Word, to investigate this idea (XML to XSLT) hasn't looked like a party to clean. Moreover, the retailers change their standards and their devices every 5 minutes. No major reader runs on XML; so...I think it was, quite simply, creating a process that would be able to reuse a file, to create other outputs, in a market that is primarily driven by entertainment books, seemed like extra work and an extra step that's unnecessary. PLUS, even if you assume arguendo that it's a good idea, then you have the problem of (say, with Textbooks), trying to export the initial content into a usable form for the author/editor to do an UPDATED version...whereas, with HTML, you can reimport the content easily back into Word or another word-processor for an author and editor to work collaboratively to update the material for a next Edition or updated textbook. Trust me: they are NOT going to sit there over something that looks like an RSS feed or XML and edit it. I think that's a major hurdle, too.

Technically, XML = XHTML = EPUB = RSS. So, if a word processor is used to edit XHTML, it is already usable for XSLT, if the output isn't crappy (like with Microsoft Word) and if semantic markup is ensured by the word processor GUI.

Quote:

Originally Posted by Hitch

If you say so. I admit, I've not seen anything that looks remotely user-friendly to which I could point my clients. And as I said somewhere in this thread, a major converter of books in India just invested a ton of money to invent/develop a system by which XML could be displayed in a Word-like, browser interface in order to provide a collaborative environment for textbook revisers/editors to work in. I'd have thought that if the environment existed, they wouldn't have spent all that money to create it, specifically for one client. I know someone else on this very forum considering creating a markup editor at one point in time; I don't know what happened with that.

The only two things that require a substantial amount of work, is implementing true WYSIWYG for print (if not done by a PostScript rendering canvas), and implementing high-end collaborative editing features.

Quote:

Originally Posted by Hitch

Yes, but again: all of those, every single one, all depend on the cleaned, ready-to-go XML being prepared and ready. I see that as the huge stumbling block, myself. For commercial users, it would have to be as simple, and as easy, as "simply" exporting and cleaning to HTML/XHTML, and it would have to be something that we could convince our users that they want, and are willing to pay for. THAT would also be a fairly big block; convincing them that they want a cleaned XML file that they themselves likely won't ever open or use, or even foresee a need for. But, I could be wrong.

Again: the word processor initially should take care of this, so that a clean, semantic XHTML is always produced. Since word processor developers fail at the moment to provide such a feature (well, they already have style templates...just disable direct formatting). But even for word processors who write Pseudo-HTML, an editor to clean such to XHTML would be useful, both for the writer himself and also for a person who does formatting for writers.

Quote:

Originally Posted by Toxaris

My biggest questionmark would actually be the first step, from a wordprocessor to 'clean' XML. Clean XML is important, because if it is clean, it is relatively easy to go to anything else again.

There is no valid excuse for a word processor in the 21st century to export HTML (or even Pseudo-HTML) instead of XHTML. Also, it should be semantic, if automated processing should be made possible. I agree with your list of problems you identified, some could be solved (use of styles enforced, no direct formatting controls in the GUI), others not (use of whitespace for optical positioning - probably a word processor could at least help with it, as it already does spell checking and all kinds of more advanced stuff).

Quote:

Originally Posted by Toxaris

It will be an almost impossible task to be able to filter/convert the output of all these programs to XML/XHTML while maintaining all the markup and taking the bizar things writers do in their documents into account. I only do it for Word and that is already a nightmare sometimes. Writers still surprise me with their workmethod and output.

Indeed, so if the output initially was done bad, a tool would be great to strip away all direct formatting, and then allow to apply style templates to the text. The Pseudo-HTML to XHTML conversion can't be done in all cases, but that's what the word processor absolutely has to fix, not some other program. I wouldn't start to parse all kinds of crappy Pseudo-HTML output. That Microsoft Word is incredibly bad to output valid XHTML, could be either deliberate protectionism for their proprietary software, or a result of their incompetence in the field of web technology (just think about Microsoft Internet Explorer).

Quote:

Originally Posted by Toxaris

The ambition is good, but the number of writers that wants to be bothered with this is very slim, especially for novelists.

I found out that at least self-publishers spend lots of hours or lots of money to do formatting for e-book and print, which could be replaced by an integrated processing software or an (online) service to do it for them.

Quote:

Originally Posted by Toxaris

However, the second part of converting the clean XML to other outputs could be very useful. That being said, XML itself is meaningless without the structure. What structure should be used? XHTML? A kind of LaTeX perhaps?

As XML can be used to specify all kinds of custom XML formats, it is too generic in order to be supported as the main input format (which would mean, all kinds of custom XML formats need to be supported), except one would develop a "mapping tool" to map custom XML to XHTML elements (for instance). As in most cases, XHTML is in web and e-book the standard to represent structured documents, that would be already pretty usable, so the wheel won't have to be reinvented, and it's also quite common for word processors and other software (including websites via browser) to output XHTML. A custom XML definition would be of advantage, if a processing system provider wants to support specific features, which would be triggered by the custom XML elements. If compatible, XHTML could be transformed into the custom XML format. Also, specialized programs that are intended as front end for the processing workflow, could output the custom XML format initially. This way, an (online) service could provide the software for it's writers to write in, and they would get, for instance, output in all formats and automated distribution into online shops. Alternatively, an (online) service would get manuscripts in all kinds of formats, would strip it down to the basic text, and then prepare it for the custom XML format, which corresponds to the processing workflow he is going to use.

Quote:

Originally Posted by Hitch

So...I'm with Tox. Getting the clean XML is the major hurdle. I just don't know how to get there from here. And I test-exported a *clean* Word file to XML last night...and, ayup, good luck with THAT.

Oh, Microsoft Word has fooled you ;-) The term "XML export" is technically nonsense, because XML isn't a format in itself, but a way to define all kinds of formats. So there is no "XML format" per se, one would have to ask "which XML format?" (because there are lots of them, XHTML included). So what Word calls ".xml", is their "Word XML format", and yes, of course, such a thing is useless, if they're not even capable of outputting valid XHTML in the first place. I do not talk about stupid custom XML formats, but of reasonable ones.

Quote:

Originally Posted by Hitch

My personal favorite? The "every paragraph is aligned differently" approach. I don't know what the hell is going on out there, educationally, but we've had a number of manuscripts in which dialogue paragraphs are unindented, and narrative are indented, or vice-versa. No, these aren't the James Joyce's of the future; they're illiterate (literally. I'm not being mean. The books are usually hardly readable). There appears to be someone out there "teaching" aspiring authors that this is the correct way to write.

Well, a word processor could refuse to show a visual difference for paragraphs, all paragraphs would be shown equally, so the use of a style template would just be a semantic markup for the processing that follows later. Indentation by whitespace only comes to mind, because a writing tool is abused to create layout. Whitespace isn't part of the text, and a writer is supposed to write text, not other stuff. A whitespace directly followed by another whitespace could be marked as spelling error. Combining writing and typesetting is a very bad idea in the first place.

Quote:

Originally Posted by Hitch

So: how does a front-end piece of software fix THAT and produce clean XML?

I don't know your actual situation, but you could do the following things, if you have control over it:

Encourage manuscript senders to provide semantic, valid XHTML by educating how to do so. If semantic, valid XHTML is provided, you could charge nothing or less than usual prices for print and e-book preparation, since you won't have to do put any manual work into it.
Require manuscript submission by online form. Authors should paste their text into it, and the form will loose all direct formattings, since the form is plain text. If it is a WYSIWYG editor to paste to, strip all direct formatting programmatically. If wished, you could provide style templates the author could apply to the text in order to semantically prepare it for your processing system.
Allow manuscript submission in all kinds of formats, and copy the plain text from it (maybe by plain text export, maybe by copy and paste). Provide the plain text for the author to do the markup for you, and just let him apply the style templates to the text which your processing system supports. Alternatively, you may do this task yourself as part of your service, with a tool like I initially thought Sigil could be. You have to fix the formatting of the text anyway, so why do e-book and print preparation separately and by hand, instead of applying semantic markup and produce e-book and print (and website from a database and whatever) from it?

Quote:

Originally Posted by Toxaris

Will not happen. There is only so much you can clean up automatically. As always GIGO.

Yes, as with "garbage in, garbage out": I would limit any attempt to throw the garbage of the input away and write highly usable files out. This way, visually encoded information and implied information gets lost, and there's no way to automatically convert it into semantic encoding. I would make it as easy as possible to encode this lost information manually in a semantic way, which could be done by the person who provided the garbage or by the person to which (for whatever reason) the task is given to make something beautiful out of the garbage.

Quote:

Originally Posted by Hitch

I tol him to leave the room "I'll be back, "David will make sure of that.' he smiled.

Now, obviously, no formatter can "save" that. It's just bugger-all bad. But I've given some contemplation to the idea of playing with your broken dialogues, Tox, to create a pass that marks up--identifies--all the broken dialogues, and then hand it BACK to the author to review, for preliminary clean up. Just toying with it. Not solidified in ye olden brain yet. Just pushing around in "the little grey cells." ;-)

Yes, nobody can fix this automatically. The semantics of this text markup lie. Additionally, the markup itself is broken. "I'll be back, " would be identified as one part, and 'll be back, "David will make sure of that.' could be considered another, since ' is used as delimiter and apostrophe at the same time.

But as with your idea to identify broken dialogues, that's exactly what I'm proposing: Keep the stuff which is already in good quality, and throw away what is not (I myself are mostly concerned about this for XML, text tools for such purpose would be a different topic, but why not work in this field also, since a solution is needed for self-publishers as well?). In this specific case, either you or the author has to re-apply apostrophes and quotation marks. If the author has to do it, make it as easy as possible for him. If you have to do it, make it as easy as possible for you. As an advanced solution, don't allow quotation marks in your (online) editing software at all, but let the author mark quotations and direct speech semantically (my initial question of this thread was, if Sigil could be that software) by something like

Code:

I tol him to leave the room <dialogue>I'll be back, David will make sure of that.</dialogue> he smiled.

which could be automatically translated to

Code:

I tol him to leave the room “I'll be back, David will make sure of that.” he smiled.

(while even taking care of typographical quotation marks and other things like that - all the formatting, essentially). You could then easily output all dialogue text, or all text but dialogue text. Since authors don't do XML markup in a text editor, you need a tool to enable yourself or the author to apply the semantic style template "dialogue" to the selected text "I'll be back, David will make sure of that."

Quote:

Originally Posted by At_Libitum

In what way?

In this way:

The crucial part about the "feature" of direct formatting shown in the screenshot is that the text "use 2 egg whites instead for healthier version" gets marked as

Code:

style="color: rgb(255, 0, 0)"

instead of semantic encoding like

Code:

class="alternative"

Such approach of direct formatting makes it very difficult for processing software: what is the color red - rgb(255, 0, 0) meant to represent? Is rgb(254, 0, 0) intended to mark the same or something different? Even with two rgb(255, 0, 0) at different portions of the text, at the one place red could be used to mark alternatives and at another place to mark important warnings, which will look for software as the same, indifferent. The information about what "use 2 egg whites instead for healthier version" means is encoded visually with red color in order to be understood implicitly by the reader. In software, there's absolutely no way to get to an implicit understanding (well, for humans also - you don't know what you don't know, otherwise you would know it, right?). The only way to solve this problem is to provide the information explicitly, either by Calibre, or by me as developer of a processing software. I could hard code which red text is of which type, but then my software won't be of general purpose anymore, it would be specific to one single book. So what Calibre would do in case such direct formatting gets introduced (or is already present as "feature" in the software), is downgrading the software by making its output less usable, even if it might look as a good new feature to the user. If text gets formatted by Calibre with this feature, it excludes authors from the benefits of automated text processing and the ability to change the layout within Calibre quickly instead of time-consuming manual work. Furthermore, software would have to implement a CSS parser if it wants to read Calibre output. Note that with a semantic approach, with style templates, there would be no difference noticable for the user for the task of formatting text as red, except that a style has to be defined first. There's still the risk to abuse style templates for visual markup (let's say a style template "red"), but if ever red + bold is needed, a new style template would be needed, so both markups would be distinguishable. I don't know of any solution to prevent such abuse, but at least if style templates get imported from an (online) service, one can make sure that the output file will be usable by the service without any further conversion problems.