MobileRead Forums - View Single Post - Automated Processing Workflows as and with Free Software

skreutzer · 02-03-2014, 08:40 PM

@dickloraine:

The company made a great effort in order to implement an automated processing workflow, which isn't different from what everybody in the field does. However, they saved themselves much more effort of hand formatting all their titles, so in digital publishing (especially in volume) there aren't that much alternatives than to change the old publishing process to an automated one. Their 250 styles represent 250 print layouts.

Target: At first I myself for my own book projects, and other autors/publishers as well. Whoever wants to save time or costs caused by manual formatting.

Editing and writing are completely different tasks than the task of formatting and typesetting, so authors writing their unformatted manuscript and editors are not the target users, while they still may benefit from an automated processing workflow if they do formatting by themselves after they've finished writing, or if they have somebody to do the formatting for them. If authors apply formatting while writing, they should use style templates instead of direct formatting. It depends on the author or formatter, if a file is usable for an automated processing workflow. If it isn't, they won't benefit.
There's no need to reinvent the wheel. XHTML is a common export format for many writing applications and word processors, also it is omnipresent in web and e-book context, so most backends will take XHTML as input format, while other input formats with corresponding backends may be supported as well.

If somebody's text is in a custom XML format (with or without schema), he may do a transformation to the input format of the preferred backend. In case one wants or needs some special features, he could adjust an existing backend to support his custom XML format as input format.
Yes, EPUB and XHTML, back and forth and also in many other directions.
Schema validation can be integrated into the workflow, but it can also be left out (by simply omitting unexpected constructs of the input file).
If your XML format is in common use, it is most likely that converters to input XML formats already exist (and if not, they might be developed then). If it is your own custom XML format, you might implement a transformator yourself or get somebody to help you with it, since I guess you would also have your own custom software to read and write your custom XML format already.
Maybe you can't generate your desired output that way, and customizing a complex software which isn't implemented as a series of individual small tools might be hard or time-consuming. If you already can generate your desired output at conditions acceptable for you, there's no need to change your system just for the sake of changing, is it? For instance, as for my book projects, manual work isn't an option.

Yes, I have to invest more time in developing an automated processing workflow than I would have to invest into doing the manual work for one project, but as there are several projects of different sizes, certainly at some time there would be a point of break-even, so once solved, always solved. Additionally, not only I myself need automation, other people might need it as well, and sharing solutions with them and collaborate in developing solutions saves time for everybody in comparison with manual work or reinventing the wheel and keep it unshared, proprietary, non-free.
Seems like a lot of existing word processors can already be used to export XML, since XHTML is XML. I thought about providing ordinary writers a JavaScript based markup tool with XHTML output, but such a thing has a high probability to already exist, and at the moment I guess predefined template documents for OpenOffice/LibreOffice are quite a good way to encourage ordinary authors to apply semantic markup, so their output can feed directly into an automated processing workflow.

I would absolutely go into the discussion about free software, because there are already lots and lots of proprietary, non-free "solutions" around, which you neither are able/allowed to access nor you're technically able to use them because of the dependency on a proprietary environment and proprietary tools. As for self-publishers, it seems there is no solution available at all at the moment which can be used independently.

You definitively should learn more about free software. Proprietary software might at any time prevent you from doing the job. The "open" approach risks to become proprietary at any time. You aren't forced to switch your operating system, to the contrary - there are so much portations of free software to proprietary operating systems available, that one gets almost all benefits from the free world even in an proprietary environment (since free software isn't discriminatory and can freely be used and modified, no matter what the operating system environment might be), while proprietary software is completely unusable on several levels in the free software world due to artificial restrictions.

As it is very likely that all software used within the automated processing workflows are also available on proprietary systems and the workflow implementation itself can be easily ported to such, you should not expect that I myself have any interest in doing so and wasting time on things which are quite contrary to the initial goal of the project, which is developing automated processing workflows with and as free software. If such a workflow is dependend on proprietary software, there's absolutely no way to run it while still respecting the digital freedoms of the user, since the user is forced to use non-freedom-respecting software in order to run the free software.

@Arios:

I don't think it is too ambitions, since I've already implemented such an automated processing workflow in the past for one of my own projects (in case you haven't found the link in the Sigil thread): http://www.freie-bibel.de/official/p...lisierung.html (unfortunately, the text itself is in German, but you might still look at the images). I'm now only trying to generalize it while making it as easy as possible to use, which will take some time and some efforts, but I'm confident that over time some progress can be made.

You are right, I should try harder to communicate the concept and idea behind what kind of solution I have in mind, since it almost looks repetitive in this discussion what the details of such an implementation would be. Usually, I would like best to just do the programming and demonstrating the results as real-world application, but at the other hand I started this thread to ask some questions, give some updates, etc., so some kind of project description could be quite useful for people who just want to look into if they could make use of an automated processing workflow or not.

I'm actually planning to provide support for writer2xhtml standalone. Writer2latex doesn't seem that usable to me at the moment, since it requires ODT (which is XML) as input, so in order to make use of a backend, an application would be required to write ODT or to convert to it. I would consider LaTeX as "irreversible" output format since it isn't XML and can't directly be used in an XML based workflow (while LaTeX to XHTML or LaTeX to ODT doesn't seem to be high-priority). At the moment, I'm aiming for ODT to XHTML and XHTML to EPUB and LaTeX (or XSL-FO), so the latter part can even be used with XHTML input from any application and without requiring ODT, while EPUB and LaTeX output would be available for ODT and XHTML sources.

Free software isn't a mere technical issue, but to look at the technical aspect: as you know, LibreOffice is written in Java, so there's a need to have a JavaVM implemented for a proprietary operating system, which is a competing product to the VMs of proprietary operating systems, so the operating system environments might make it artificially difficult for a Java VM implementation.

If Oracle decides to discontinue its proprietary VM implementations for the proprietary operating systems, all freely licensed Java software will get effectively be unusable on proprietary operating systems, leaving only the free implementation of a Java VM (OpenJDK) intact - and I guess there wouldn't be much interest in porting it on the proprietary operating systems for obvious reasons (lots of work, no gain at all, contrary to the goals of free software). So the LibreOffice ports to proprietary systems are in constant danger of loosing their technical foundation, and if LibreOffice can't be supported any longer on proprietary operating systems, just guess what its users will do then: they'll switch to a proprietary word processor. Since Java bytecode is portable, Java programs are automatically cross-platform, but it still depends on licensing, if you are allowed to use, modify, distribute a program or not.

To demonstrate the benefits of semantic encoding, especially if combined with automated processing, I would provide a set of default layouts with the automated processing workflow, including ODT document templates with predefined styles. Then I would show how text in the ODT documents could be easily rendered into PDF and EPUB without manual adjustments, as long as the template styles are supported by the backend and the template styles get applied semantically in the ODT document (additionally showing the opportunity that the visual appearance in the generated files can easily be changed without the need of manual adjustments to the ODT by just changing the style implementation in the backend).

I observed that some of the self-publishers use predefined Microsoft word templates for print formatting, but they have to do e-book formatting separately (and keep both in sync), or they do both by manual direct formatting. I haven't got a better idea to show the benefits of semantic encoding than demonstrating that this manual efforts aren't necessary if semantic formatting is applied in the first place.

@Tex2002ans:

I know a pretty good EPUB3 reading software: it's called "webbrowser" ;-) No, seriously, just display the TOC in a side pane of a browser window, that's it.

"Infogrid: Digital Publisher": yes, exactly such an automated processing workflow like they have, but for everyone as free software.

Regarding your other observations: for the input, at some point in time, somebody has to apply formatting to a text in order to prepare it for output generation. At this point, whoever and whenever it is, one has to decide between direct formatting and semantic markup. If the decision is made in favor of semantic markup, it will be beneficial for the person who did the decision, and if not, this person will exclude himself from the benefits of automated processing. It can be the author, it can be an intermediar. For the output, I don't think a "one size fits all" approach would be the best way to handle it, I would rather develop several individual processors for different features and layouts. Initially, I would aim for basic book features, and maybe there might be more complex document representation as well (probably by integration of more sophisticated software that's already existing for such tasks?).

However, as you mentioned the specific example of referencing, such things aren't difficult at all for automated processing, and as EPUB->LaTeX->PDF is basically XHTML->LaTeX->PDF, I already do something very similar quite often (however, not with the goal of representing the XHTML visual appearance of a browser in the PDF). For your question on a intermediate format: there might be lots of intermediate formats, some probably more complex, some more simple. If somebody wants to implement complex output, he also defines how the input should look like (which markup is supported by the backend), so input files can be transformed to this "public specified interface" of the backend in order to produce the output, be it by one intermediate step or several intermediate steps as part of a larger workflow. Just combine some scripts that do the transformations or add data/structure. However, I've started pretty simple with only basic features, and expand from there as needed.

Consistency across all processed books will automatically be guaranteed, since the processing workflow won't change it's internals by itself ;-) If the initial layout definition is done right in the first place, I also think this is an advantage for the ordinary writer to get quality layout, eliminating the risk of accidentally inserting errors into the design.

02-03-2014, 08:40 PM	#18
skreutzer Software Developer Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	@dickloraine: The company made a great effort in order to implement an automated processing workflow, which isn't different from what everybody in the field does. However, they saved themselves much more effort of hand formatting all their titles, so in digital publishing (especially in volume) there aren't that much alternatives than to change the old publishing process to an automated one. Their 250 styles represent 250 print layouts. Target: At first I myself for my own book projects, and other autors/publishers as well. Whoever wants to save time or costs caused by manual formatting. Editing and writing are completely different tasks than the task of formatting and typesetting, so authors writing their unformatted manuscript and editors are not the target users, while they still may benefit from an automated processing workflow if they do formatting by themselves after they've finished writing, or if they have somebody to do the formatting for them. If authors apply formatting while writing, they should use style templates instead of direct formatting. It depends on the author or formatter, if a file is usable for an automated processing workflow. If it isn't, they won't benefit. There's no need to reinvent the wheel. XHTML is a common export format for many writing applications and word processors, also it is omnipresent in web and e-book context, so most backends will take XHTML as input format, while other input formats with corresponding backends may be supported as well. If somebody's text is in a custom XML format (with or without schema), he may do a transformation to the input format of the preferred backend. In case one wants or needs some special features, he could adjust an existing backend to support his custom XML format as input format. Yes, EPUB and XHTML, back and forth and also in many other directions. Schema validation can be integrated into the workflow, but it can also be left out (by simply omitting unexpected constructs of the input file). If your XML format is in common use, it is most likely that converters to input XML formats already exist (and if not, they might be developed then). If it is your own custom XML format, you might implement a transformator yourself or get somebody to help you with it, since I guess you would also have your own custom software to read and write your custom XML format already. Maybe you can't generate your desired output that way, and customizing a complex software which isn't implemented as a series of individual small tools might be hard or time-consuming. If you already can generate your desired output at conditions acceptable for you, there's no need to change your system just for the sake of changing, is it? For instance, as for my book projects, manual work isn't an option. Yes, I have to invest more time in developing an automated processing workflow than I would have to invest into doing the manual work for one project, but as there are several projects of different sizes, certainly at some time there would be a point of break-even, so once solved, always solved. Additionally, not only I myself need automation, other people might need it as well, and sharing solutions with them and collaborate in developing solutions saves time for everybody in comparison with manual work or reinventing the wheel and keep it unshared, proprietary, non-free. Seems like a lot of existing word processors can already be used to export XML, since XHTML is XML. I thought about providing ordinary writers a JavaScript based markup tool with XHTML output, but such a thing has a high probability to already exist, and at the moment I guess predefined template documents for OpenOffice/LibreOffice are quite a good way to encourage ordinary authors to apply semantic markup, so their output can feed directly into an automated processing workflow. I would absolutely go into the discussion about free software, because there are already lots and lots of proprietary, non-free "solutions" around, which you neither are able/allowed to access nor you're technically able to use them because of the dependency on a proprietary environment and proprietary tools. As for self-publishers, it seems there is no solution available at all at the moment which can be used independently. You definitively should learn more about free software. Proprietary software might at any time prevent you from doing the job. The "open" approach risks to become proprietary at any time. You aren't forced to switch your operating system, to the contrary - there are so much portations of free software to proprietary operating systems available, that one gets almost all benefits from the free world even in an proprietary environment (since free software isn't discriminatory and can freely be used and modified, no matter what the operating system environment might be), while proprietary software is completely unusable on several levels in the free software world due to artificial restrictions. As it is very likely that all software used within the automated processing workflows are also available on proprietary systems and the workflow implementation itself can be easily ported to such, you should not expect that I myself have any interest in doing so and wasting time on things which are quite contrary to the initial goal of the project, which is developing automated processing workflows with and as free software. If such a workflow is dependend on proprietary software, there's absolutely no way to run it while still respecting the digital freedoms of the user, since the user is forced to use non-freedom-respecting software in order to run the free software. @Arios: I don't think it is too ambitions, since I've already implemented such an automated processing workflow in the past for one of my own projects (in case you haven't found the link in the Sigil thread): http://www.freie-bibel.de/official/p...lisierung.html (unfortunately, the text itself is in German, but you might still look at the images). I'm now only trying to generalize it while making it as easy as possible to use, which will take some time and some efforts, but I'm confident that over time some progress can be made. You are right, I should try harder to communicate the concept and idea behind what kind of solution I have in mind, since it almost looks repetitive in this discussion what the details of such an implementation would be. Usually, I would like best to just do the programming and demonstrating the results as real-world application, but at the other hand I started this thread to ask some questions, give some updates, etc., so some kind of project description could be quite useful for people who just want to look into if they could make use of an automated processing workflow or not. I'm actually planning to provide support for writer2xhtml standalone. Writer2latex doesn't seem that usable to me at the moment, since it requires ODT (which is XML) as input, so in order to make use of a backend, an application would be required to write ODT or to convert to it. I would consider LaTeX as "irreversible" output format since it isn't XML and can't directly be used in an XML based workflow (while LaTeX to XHTML or LaTeX to ODT doesn't seem to be high-priority). At the moment, I'm aiming for ODT to XHTML and XHTML to EPUB and LaTeX (or XSL-FO), so the latter part can even be used with XHTML input from any application and without requiring ODT, while EPUB and LaTeX output would be available for ODT and XHTML sources. Free software isn't a mere technical issue, but to look at the technical aspect: as you know, LibreOffice is written in Java, so there's a need to have a JavaVM implemented for a proprietary operating system, which is a competing product to the VMs of proprietary operating systems, so the operating system environments might make it artificially difficult for a Java VM implementation. If Oracle decides to discontinue its proprietary VM implementations for the proprietary operating systems, all freely licensed Java software will get effectively be unusable on proprietary operating systems, leaving only the free implementation of a Java VM (OpenJDK) intact - and I guess there wouldn't be much interest in porting it on the proprietary operating systems for obvious reasons (lots of work, no gain at all, contrary to the goals of free software). So the LibreOffice ports to proprietary systems are in constant danger of loosing their technical foundation, and if LibreOffice can't be supported any longer on proprietary operating systems, just guess what its users will do then: they'll switch to a proprietary word processor. Since Java bytecode is portable, Java programs are automatically cross-platform, but it still depends on licensing, if you are allowed to use, modify, distribute a program or not. To demonstrate the benefits of semantic encoding, especially if combined with automated processing, I would provide a set of default layouts with the automated processing workflow, including ODT document templates with predefined styles. Then I would show how text in the ODT documents could be easily rendered into PDF and EPUB without manual adjustments, as long as the template styles are supported by the backend and the template styles get applied semantically in the ODT document (additionally showing the opportunity that the visual appearance in the generated files can easily be changed without the need of manual adjustments to the ODT by just changing the style implementation in the backend). I observed that some of the self-publishers use predefined Microsoft word templates for print formatting, but they have to do e-book formatting separately (and keep both in sync), or they do both by manual direct formatting. I haven't got a better idea to show the benefits of semantic encoding than demonstrating that this manual efforts aren't necessary if semantic formatting is applied in the first place. @Tex2002ans: I know a pretty good EPUB3 reading software: it's called "webbrowser" ;-) No, seriously, just display the TOC in a side pane of a browser window, that's it. "Infogrid: Digital Publisher": yes, exactly such an automated processing workflow like they have, but for everyone as free software. Regarding your other observations: for the input, at some point in time, somebody has to apply formatting to a text in order to prepare it for output generation. At this point, whoever and whenever it is, one has to decide between direct formatting and semantic markup. If the decision is made in favor of semantic markup, it will be beneficial for the person who did the decision, and if not, this person will exclude himself from the benefits of automated processing. It can be the author, it can be an intermediar. For the output, I don't think a "one size fits all" approach would be the best way to handle it, I would rather develop several individual processors for different features and layouts. Initially, I would aim for basic book features, and maybe there might be more complex document representation as well (probably by integration of more sophisticated software that's already existing for such tasks?). However, as you mentioned the specific example of referencing, such things aren't difficult at all for automated processing, and as EPUB->LaTeX->PDF is basically XHTML->LaTeX->PDF, I already do something very similar quite often (however, not with the goal of representing the XHTML visual appearance of a browser in the PDF). For your question on a intermediate format: there might be lots of intermediate formats, some probably more complex, some more simple. If somebody wants to implement complex output, he also defines how the input should look like (which markup is supported by the backend), so input files can be transformed to this "public specified interface" of the backend in order to produce the output, be it by one intermediate step or several intermediate steps as part of a larger workflow. Just combine some scripts that do the transformations or add data/structure. However, I've started pretty simple with only basic features, and expand from there as needed. Consistency across all processed books will automatically be guaranteed, since the processing workflow won't change it's internals by itself ;-) If the initial layout definition is done right in the first place, I also think this is an advantage for the ordinary writer to get quality layout, eliminating the risk of accidentally inserting errors into the design. Last edited by skreutzer; 02-09-2014 at 09:03 AM.