|  01-29-2014, 12:08 PM | #1 | 
| Software Developer            Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3 | 
				
				Automated Processing Workflows as and with Free Software
			 
			
			I'm working on implementing automated processing workflows based on XML as and with free software. The concept is to develop several XML transformation tools to produce various output formats from the input XML. XML is easily readable and writable, and most programming languages provide interfaces to access data stored in XML form (which is the reason why there are already lots and lots of tools to work with XML based files). Those XML transformation tools can be combined to automated workflows, but should also be usable independently. The most common use case of such tools would be to take a semantic XHTML and produce EPUB, PDF and other target formats from it, while the entire process should be both highly flexible and customizable. XHTML could be considered the default input format because that's a good and widespread way to represent documents in XML, while a lot of text editing applications export XHTML. Other standardized or custom XML input formats could be supported as well. The input has to be semantic, because it would add incredible complexity to the processing if every direct formatting would require mapping to the formatting of the target format, and still the result wouldn't represent the visual appearance of the source file anyway. Additionally, semantic markup is beneficial when applying layout modifications automatically. As for now, I provide support for workflows on the 100% free operating system distribution gNewSense 3.0. With a Java 1.6 VM, a XHTML input file can be processed to EPUB2. With the writer2latex package, an ODT can be processed to XHTML. Therefore, OpenOffice/LibreOffice can be used as front end to apply semantic markup to raw text. At the moment, I'm working on automating the workflow, which includes GUI tools to edit the settings of the workflow, if it isn't used as pure processing system from command line. In the future, the goal is to develop an XHTML to LaTeX converter in order to automatically generate PDF by applying pre-defined layout templates, which will get matched to the semantic markup of the input (I already did so for a custom XML input format, now the task is to generalize it for all kinds of documents, namely by porting the existing tool to XHTML). An additional goal for the current setup could be to remove OpenOffice/LibreOffice API dependencies from the writer2latex package, because as ODT is a XML based format, all conversions could probably be done as pure XML transformations (the writer2latex package was designed to be an OpenOffice addon and therefore relies on some OpenOffice specific code). Note that EPUB to XHTML to ODT or LaTeX conversions could be considered too at a later stage. In case you're interested in such automated workflows, please feel free to discuss questions, your demands, possible usage and potential solutions. See an already existing discussion about this topic in the context of Sigil as front end for applying semantic markup to plain text. Please keep in mind that the automated processing workflow as and with free software I'm developing started out very primitively, and will improve and expand over time. Theoretical background and actual implementations in commercial and scientific context, which we probably might want too as free software for self-publishers and new online or offline publishing services (list ordered by relevance): 
 This list will get updated! Last edited by skreutzer; 01-31-2014 at 09:40 AM. | 
|   |   | 
|  01-29-2014, 01:15 PM | #2 | 
| Grand Sorcerer            Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7 | 
			
			Having followed this discussion for awhile now I think this new thread is better in the workshop forum and not in the other formats forum since it is not about other formats but rather about methods to automate the creation of eBooks. Thus I have moved the thread. Dale | 
|   |   | 
|  01-29-2014, 07:49 PM | #3 | |
| Wizard            Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi) | 
			
			@DaleDe Thank you. Just sitting in the front row now.  For not being totally useless, I quote the information that St_Albert just gave me on how to work with writer2latex as a standalone tool: Quote: 
 Last edited by roger64; 01-29-2014 at 07:55 PM. | |
|   |   | 
|  01-29-2014, 11:15 PM | #4 | 
| Guru            Posts: 631 Karma: 7544528 Join Date: Apr 2013 Location: Berlin Device: PRS 350, Kobo Aura | 
			
			I don't know, if this helps, but there are similiar tools available but instead of xml they use some form of markdown. First there is Pandoc. This is mainly a converter. And then there is AsciiDoc, which is an advanced markdown syntax. After a quick look, this could be very interesting. But it is (or looks) not that simple. And as it is markdown, there is no word processor involved, but simple editors. So no WYSIWYG, no GUI specialized for the task etc. On first sight, it seems to suffer on something that unfortunatly affects a lot of open/free Software: It is not user friendly. All you get is a command line tool. To convert to for example pdf, you have to manually install Latex, DocBook etc. Alone the setup is not that easy. And no GUI or a customized Notepad App to write is another difficulty. Not to speak about a revision system, spell-checking, comments etc. that a word processor provides. But maybe you could work with something this projects do. | 
|   |   | 
|  01-30-2014, 07:25 AM | #5 | 
| Software Developer            Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3 | 
			
			@DaleDe: Thanks for the moving, I was quite unsure where this topic belongs to. @roger64: Yes, I do exactly the same as st_albert described (using writer2latex as standalone), but not as direct ODT to EPUB conversion, instead, for ODT to XHTML to EPUB conversion, since there could be a lot of things one might to do to the XHTML, including XML validation, adding, removing or moving parts of the document, writing a log of spelling and grammar mistakes or inserting soft-hyphens etc. Further, with XTHML to EPUB as separate, indipendent step, an ODT source file isn't required, other XHTML exporting software or websites can be used as well. Furthermore, instead of mixing ODT and EPUB, LaTeX and every other future output format (which is caused by the goal of the writer2latex package to represent the ODT as displayed in OpenOffice/LibreOffice WYSIWYG visual representation), other front ends and back ends could be added and customized at any time without the need of adjusting all of the other parts of the processing workflow. I ensure that ODT to XHTML to EPUB works at present, and I hopefully will be able to ensure this in the future, too, while helping out to set up the environment needed for such conversions. Currently I'm working on a shell script to automate the conversion process from ODT to XHTML to EPUB, and after this I'm planning to build some tool to manage book projects for this shell script (adding projects, manage metadata, run conversions in bulk), and after that I'm planning to build a tool to manage the setup of the workflow environment. Things like converting multiple ODTs to a single EPUB could then be added relatively easily. @dickloraine: Other tools can be integrated at any time. However, for myself, I won't put time into integrating a tool which isn't free software or adds unnecessary dependencies, while still you for yourself can do so or other people can provide support and help for proprietary, restrictive, non-free tools, if they like - even if it is a pretty bad idea anyway. Some time ago I already experimented with pandoc, which is quite usable, but witten in Haskell, so my fear was (and is) that this could make it hard for other developers and the community to fully take advantage of it. But it is definitively an option, yes, and it can be integrated as long as there aren't more advanced converters available to the processing workflow. AsciiDoc isn't an option, because nowadays it is technically not reasonable to support other plain text formats than XML for processing automation, since XML is widely supported in most programming languages, while every custom plain text format has to be parsed separately. So custom plain text formats only play a role when converted to XML, or as target formats which are not intended to be processed any further (like *.tex). AsciiDoc therefore could be a front end for text editing, because AsciiDoc files can be converted to XML/XHTML. As you mentioned that command line tools are not user friendly, you have to keep in mind that command line tools are in most cases the only tools which are automatable, while GUI tools are often not. So in general it is a good idea to maintain a tool as command line tool in order to keep automatibility, and develop a GUI on top of it to make it user friendly, just as LyX does for LaTeX. For your LaTeX setup example, you should consider that installing LaTeX on a free operating system is just one click, because software package management is an integral part of free software practices, while on non-free, proprietary operating systems still there are LaTeX ports available, where their setups need a few clicks instead of just one, but LaTeX there too is easy to install. Some time ago I discovered river-valley.tv and I'm revisiting to look through the videos. I'll pick some of them who describe the theoretical background and actual implementations of automated processing workflows, which are quite common in the scientific and commercial context, so probably we might want something similar as free software for self-publishers or new online publishing services. I could come up with a lot of ideas what could be implemented, but just as a hint: an automated processing workflow could take the mobileread.com RSS feed URL (or any URL that links to websites or parts of websites) as input and convert the latest posts you haven't read yet to EPUB, sorted by date, topic or alphabetically, with or without an option to subscribe to specific topics or forum categories. Or an entire thread could be converted to beautiful PDF, so that conversations can be preserved physically as bound hardcover-book via print-on-demand technology. One already can do so with wget/cURL, and websites or browsers may already provide EPUB and PDF download/export. Still I would like to have universal processing workflows available which aren't specific to a browser as PDF creator or a website's export to EPUB or to websites from the internet, because the very same code would also process XHTML export from a word processor and other sources, so they would be beneficial for a lot more people, not only limited to specialized contexts. Last edited by skreutzer; 01-30-2014 at 10:34 AM. | 
|   |   | 
|  01-30-2014, 11:35 AM | #6 | 
| Guru            Posts: 631 Karma: 7544528 Join Date: Apr 2013 Location: Berlin Device: PRS 350, Kobo Aura | 
			
			Yeah, but as I understand it, the thought behind markdown/AsciiDoc is: Nearly nobody writes books in XML. To be able to automatically process something, it needs to be in a form, that is automatically processable. And on the beginning of the chain, there is just an author who writes. So to be able to automatically process what the writer writes, you need the author to tell the program, what he means. So you need some form of semantic styles. So AsciiDoc is nothing more, than such a semantic language. And you will need something like that. You need to ensure, that an author uses semantics. Of course, you could requiere each author to define his own semantics and tell your programm, how it should handle it. But it is more easy, to provide the basic semantics.
		 | 
|   |   | 
|  01-30-2014, 12:03 PM | #7 | 
| Software Developer            Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3 | 
			
			You are right, nobody writes a book directly in XML. However, no processing workflow will be based on AsciiDoc. So the way to go is to convert AsciiDoc format to XML or XHTML in order to feed the text (retaining its semantic markup) into the automated processing workflow. I'm perfectly fine with it, and as AsciiDoc provides (hopefully well-formed) XML and (hopefully valid) XHTML out of the box, it could be quite usable as front end for semantic writing, especially because it is a valuable goal to encourage semantic markup instead of direct formatting right from the beginning. As AsciiDoc is free software, I might look into supporting it as front end. Do you use AsciiDoc or do you know somebody who writes files in AsciiDoc and wants automated output creation?
		 | 
|   |   | 
|  01-30-2014, 01:34 PM | #8 | 
| Guru            Posts: 631 Karma: 7544528 Join Date: Apr 2013 Location: Berlin Device: PRS 350, Kobo Aura | 
			
			No, I don't use it. I just mentioned it, because I think that a front end or some rules about what semantic styles should be  used are needed. Personally I think the best possibility would be a set of predefined styles. Of course these could be provided by others. But some default styles and then templates how they would  be rendered in the destination formats could be very helpful. So someone could make an word .dot with the right styles, an author could use it to write and then convert it to xml. So it would be  possible to write in nearly all word processors or even editors (with some form of markdown) if someone writes a module, which converts the output to this standard xml.  But maybe this is nothing that you need to do. As I understand it, everyone could use what you do, to make such environments. Therefore such rules do not need to be part of what you do. But I think such a front end is needed in addition. After rereading what you said about open office, that could be an example for such an environment. Make an open office template and add a plugin to open office, which converts using the styles defined in the template and it could be an easy to use solution. So in short: I think I misread what you want to do or I didn't see immediately how it would be used by an author   | 
|   |   | 
|  01-30-2014, 01:37 PM | #9 | 
| Grand Sorcerer            Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7 | |
|   |   | 
|  01-30-2014, 05:09 PM | #10 | 
| Software Developer            Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3 | 
			
			@dickloraine: Yes, predefined styles as default and example, while front end styles could be matched to styles the back end would recognizes, too. So one could have predefined or custom styles at the front end or the back end or both, and if needed, a matching would be configured. Ideally, word processors would provide an import and export for style definitions, but even without a template file could be provided for people to write in, and its processibility by the corresponding processing workflow could be guaranteed. The main use case is based on the observation that the ordinary author and even some professionals tend to apply direct formatting for typesetting and e-book generation (since both are incompatible and therefore require duplicate manual work), so it costs either time or money (spent for somebody elses time) or quality. Since some tools which are used for writing don't encourage semantic markup and automated processing workflows aren't accessible enough yet, solving both issues could at least for some provide a better alternative with every new book project. Also, processing texts in bulk and specialized workflows to produce custom output seems to be very attractive for more advanced users, who in most cases need automation in one way or another anyway.
		 | 
|   |   | 
|  01-31-2014, 03:02 AM | #11 | |
| Bookmaker & Cat Slave            Posts: 11,503 Karma: 158448243 Join Date: Apr 2010 Location: Phoenix, AZ Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2 | Quote: 
 You might want to read the original, rather lengthy thread over at the Sigil forums: https://www.mobileread.com/forums/sho...80&postcount=1 before you go further down this path of discussion. Just sayin'. Good luck. Hitch | |
|   |   | 
|  01-31-2014, 09:40 AM | #12 | 
| Software Developer            Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3 | 
			
			Hitch, I already linked the thread in my initial post.
		 | 
|   |   | 
|  01-31-2014, 04:49 PM | #13 | 
| Bookmaker & Cat Slave            Posts: 11,503 Karma: 158448243 Join Date: Apr 2010 Location: Phoenix, AZ Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2 | 
			
			Yes, I know, but I'm not sure Dickloraine, based on his comments, realized that a lot of this ground had been previously and, from his perspective, unfruitfully trodden.  Just trying to save him the effort of saying the same things that everyone else has already said to you. Hitch | 
|   |   | 
|  02-01-2014, 08:04 PM | #14 | 
| Software Developer            Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3 | 
			
			Saving him time of suggesting me that such a workflow can't work?
		 Last edited by skreutzer; 02-02-2014 at 10:15 AM. | 
|   |   | 
|  02-02-2014, 12:19 PM | #15 | 
| Guru            Posts: 631 Karma: 7544528 Join Date: Apr 2013 Location: Berlin Device: PRS 350, Kobo Aura | 
			
			I just skimmed over the other post. And yes, after reading that and rethinking your approach, there are many questions for me. Have you actualy watched the video, you posted in your first post? I mean, this company has made a great effort to make a norm for their publications. Resulting in ruffly 250 styles. So let me ask you a few questions, that I think you should answer for yourself to see where this project should lead to: - Who is the target of this project? Authors, publishers, editors? - Do you want to define your own new xml-standard for this? Why should anybody use your standard? Should I define my own? Again, who is the target? For a publisher it may make sense to develop such a scheme. But it is very difficult. - Why not use some existing format and work from there? epub or xhtml? - Making a scheme for your xml is like writing a mark up language, hence my mentioning of AsciiDoc. How difficult is this and how difficult is it to enforce the user to learn this language? - Even if all your processes work, how should it work, to get my Source format into your xml? - And a step further: Wouldn't it make more sense for me, if I could just skip this step? Why not just convert from my source to the desired output? - Or do you want to make a xml word processor? And again: Who is the target? It is very difficult to talk about such a project, if the goals aren't realy clear. Who should how do what with it? We only know a bit about the what. I won't greatly go into the discussion about free software (don't know much about it). But to be true: I don't think users go into these details about the difference between free, open, proprietary etc. I use, what is best capable to do the job. And feeling forced to switch my operating system or my software just to use a "free" software has exactly the opposite effect on me. In my opinion, this does more harm to your cause, than "supporting" the enemy. | 
|   |   | 
|  | 
| Tags | 
| automated processing, epub, pdf, xhtml, xml | 
| Thread Tools | Search this Thread | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Sigil as front end for automated XML based processing workflows? | skreutzer | Sigil | 60 | 01-29-2014 12:10 PM | 
| Workflows to use Calibre with iOS Apps: Good Reader-PDFs, Marvin-epub, Kindle-mobi? | crashnburn | Calibre | 4 | 06-14-2013 04:49 PM | 
| Bug in Kobo processing of epub files causing hang in "Processing content" | BensonBear | Kobo Reader | 21 | 12-21-2012 05:47 AM | 
| Sideloading + Annotations and Highlights Workflows? | jddunn | Amazon Fire | 5 | 12-13-2012 03:59 AM | 
| Other Non-Fiction Stallman, Richard M.: Free Software, Free Society, PDF v1.0, 4 March 2009 | scottdw | Other Books | 1 | 12-15-2011 03:02 PM |