MobileRead Forums - View Single Post - Automated Processing Workflows as and with Free Software

skreutzer · 01-29-2014, 12:08 PM

I'm working on implementing automated processing workflows based on XML as and with free software. The concept is to develop several XML transformation tools to produce various output formats from the input XML. XML is easily readable and writable, and most programming languages provide interfaces to access data stored in XML form (which is the reason why there are already lots and lots of tools to work with XML based files). Those XML transformation tools can be combined to automated workflows, but should also be usable independently. The most common use case of such tools would be to take a semantic XHTML and produce EPUB, PDF and other target formats from it, while the entire process should be both highly flexible and customizable. XHTML could be considered the default input format because that's a good and widespread way to represent documents in XML, while a lot of text editing applications export XHTML. Other standardized or custom XML input formats could be supported as well. The input has to be semantic, because it would add incredible complexity to the processing if every direct formatting would require mapping to the formatting of the target format, and still the result wouldn't represent the visual appearance of the source file anyway. Additionally, semantic markup is beneficial when applying layout modifications automatically.

As for now, I provide support for workflows on the 100% free operating system distribution gNewSense 3.0. With a Java 1.6 VM, a XHTML input file can be processed to EPUB2. With the writer2latex package, an ODT can be processed to XHTML. Therefore, OpenOffice/LibreOffice can be used as front end to apply semantic markup to raw text. At the moment, I'm working on automating the workflow, which includes GUI tools to edit the settings of the workflow, if it isn't used as pure processing system from command line. In the future, the goal is to develop an XHTML to LaTeX converter in order to automatically generate PDF by applying pre-defined layout templates, which will get matched to the semantic markup of the input (I already did so for a custom XML input format, now the task is to generalize it for all kinds of documents, namely by porting the existing tool to XHTML). An additional goal for the current setup could be to remove OpenOffice/LibreOffice API dependencies from the writer2latex package, because as ODT is a XML based format, all conversions could probably be done as pure XML transformations (the writer2latex package was designed to be an OpenOffice addon and therefore relies on some OpenOffice specific code). Note that EPUB to XHTML to ODT or LaTeX conversions could be considered too at a later stage.

In case you're interested in such automated workflows, please feel free to discuss questions, your demands, possible usage and potential solutions. See an already existing discussion about this topic in the context of Sigil as front end for applying semantic markup to plain text. Please keep in mind that the automated processing workflow as and with free software I'm developing started out very primitively, and will improve and expand over time.

Theoretical background and actual implementations in commercial and scientific context, which we probably might want too as free software for self-publishers and new online or offline publishing services (list ordered by relevance):

Describes how an automated XML processing workflow is used by a publisher to produce output in various formats from the same input file. Hodder Education by Alyssum Ross. For complex designs (no or few page structure, almost more DTP style), they use Microsoft Word as front end for semantic encoding (style templates) and produce XML from it at the very last stage, so that the XML can still be processed to EPUB or sold independently to websites who import the data from the XML. Less manual work envolved. No direct formatting, automated appliance of the layout onto the used styles. I assume the term "standardized" for Microsoft Word output is solely based on the style definitions of their Word template.

This list will get updated!

01-29-2014, 12:08 PM	#1
skreutzer Software Developer Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	Automated Processing Workflows as and with Free Software I'm working on implementing automated processing workflows based on XML as and with free software. The concept is to develop several XML transformation tools to produce various output formats from the input XML. XML is easily readable and writable, and most programming languages provide interfaces to access data stored in XML form (which is the reason why there are already lots and lots of tools to work with XML based files). Those XML transformation tools can be combined to automated workflows, but should also be usable independently. The most common use case of such tools would be to take a semantic XHTML and produce EPUB, PDF and other target formats from it, while the entire process should be both highly flexible and customizable. XHTML could be considered the default input format because that's a good and widespread way to represent documents in XML, while a lot of text editing applications export XHTML. Other standardized or custom XML input formats could be supported as well. The input has to be semantic, because it would add incredible complexity to the processing if every direct formatting would require mapping to the formatting of the target format, and still the result wouldn't represent the visual appearance of the source file anyway. Additionally, semantic markup is beneficial when applying layout modifications automatically. As for now, I provide support for workflows on the 100% free operating system distribution gNewSense 3.0. With a Java 1.6 VM, a XHTML input file can be processed to EPUB2. With the writer2latex package, an ODT can be processed to XHTML. Therefore, OpenOffice/LibreOffice can be used as front end to apply semantic markup to raw text. At the moment, I'm working on automating the workflow, which includes GUI tools to edit the settings of the workflow, if it isn't used as pure processing system from command line. In the future, the goal is to develop an XHTML to LaTeX converter in order to automatically generate PDF by applying pre-defined layout templates, which will get matched to the semantic markup of the input (I already did so for a custom XML input format, now the task is to generalize it for all kinds of documents, namely by porting the existing tool to XHTML). An additional goal for the current setup could be to remove OpenOffice/LibreOffice API dependencies from the writer2latex package, because as ODT is a XML based format, all conversions could probably be done as pure XML transformations (the writer2latex package was designed to be an OpenOffice addon and therefore relies on some OpenOffice specific code). Note that EPUB to XHTML to ODT or LaTeX conversions could be considered too at a later stage. In case you're interested in such automated workflows, please feel free to discuss questions, your demands, possible usage and potential solutions. See an already existing discussion about this topic in the context of Sigil as front end for applying semantic markup to plain text. Please keep in mind that the automated processing workflow as and with free software I'm developing started out very primitively, and will improve and expand over time. Theoretical background and actual implementations in commercial and scientific context, which we probably might want too as free software for self-publishers and new online or offline publishing services (list ordered by relevance): Describes how an automated XML processing workflow is used by a publisher to produce output in various formats from the same input file. Hodder Education by Alyssum Ross. For complex designs (no or few page structure, almost more DTP style), they use Microsoft Word as front end for semantic encoding (style templates) and produce XML from it at the very last stage, so that the XML can still be processed to EPUB or sold independently to websites who import the data from the XML. Less manual work envolved. No direct formatting, automated appliance of the layout onto the used styles. I assume the term "standardized" for Microsoft Word output is solely based on the style definitions of their Word template. This list will get updated! Last edited by skreutzer; 01-31-2014 at 09:40 AM.