Automated Processing Workflows as and with Free Software - Page 2

Arios · 02-03-2014, 12:23 PM

skreutzer, I do not know, but I think your project is very ambitious. I have no problem with that, but this could complicates things, at least initially.

By the way could you summarize point by point, in list form, the main aspects and goals of your project? (dickloraine has worded some good questions about it). You know, sometimes trees hide the forest!

I think the idea suggested by roger64 is very good, and if that is the case why not start with the 'easy' stuff?

Let me explain. Writer2LaTeX/Writer2xhtml has not been updated for some time, so why not start with this update if you can do it? The source code is available and can be modified within the GNU LGPL if I'm not mistaken. I am convinced that hardcore Writer2xhtml users will be happy to suggest you what should be improved.

Another positive aspect of this suggestion is that it allows you to put aside the issue of free and non-free os/apps since LibreOffice or AOO are cross-platform.

Finally, an another interesting thing to do would be to create, as you suggested yourself, a kind of wizard which will guide users to better understand the enormous interest of semantic encoding. With suitable templates and some macros for example, the end-user would be offered to create a project (within LO or AOO), step by step, and show him what to do to properly encode his or her documents.

Tex2002ans · 02-03-2014, 06:03 PM

I have been a fan of this idea ever since you posted the first topic.. just have been a little busy with a few large projects.. so I haven't gotten the time to sit and write my usual in-depth tomes.

I stumbled upon this many months ago when looking up some EPUB3 information. This company called Infogrid Pacific works on one of the few EPUB3 reading programs, AZARDI. They also have a program out there called "Infogrid: Digital Publisher", which states to do exactly this:

http://www.infogridpacific.com/DigitalPublisher.html

You use their program to create their intermediary, and then it allows you to output that one file into a wide range out output formats:

Click image for larger version

Name: fod2012_online.png
Views: 273
Size: 50.4 KB
ID: 118699

Their blog had some very useful information on EPUB3, and there are some good nuggets of information there on OCR/different formats (I haven't taken a look in about a year though.... I recall most the posts being self-promotional):

http://www.infogridpacific.com/blog/

Maybe you might be able to gather some good ideas from their documentation/manuals/blogs/posts.

Quote:

Originally Posted by skreutzer

So in general it is a good idea to maintain a tool as command line tool in order to keep automatibility, and develop a GUI on top of it to make it user friendly, just as LyX does for LaTeX.

I have completed ~200 PDF (scans)->OCR->EPUB non-fiction economics books. All are coded with consistent CSS (only a few very minor variations per EPUB).

I actually spent a nice chunk of time in December looking as a way to go backwards from my consistent XHTML -> LaTeX -> PDF.

Jellby pointed me towards:

PrinceXML: http://www.princexml.com/
- XHTML + CSS = PDF
Not so Short Introduction to LaTeX: http://tug.ctan.org/tex-archive/info/lshort/english/
I stumbled upon LyX: http://www.lyx.org/
- GUI for creating LaTeX documents. Their documentation was also FANTASTIC... while testing it out, I was reminded a lot of Sigil.

(This is an ongoing project, at the pace I am going getting distracted with more and more book conversion, this EPUB -> PDF research will probably take me years!

)

Now, I see a few large problems:

Initial
- Getting consistent input.
  - As was mentioned... most of the documents out there are HORRORS.
    - Something like Toxaris's Macro for Microsoft Word is extremely helpful.
    - Something like Writer2EPUB is helpful.
  - I agree with your idea of having a LyX-type WYSIWYM editor. That would be ideal.
    - BUT, the thing is, getting people to use it. As Hitch mentioned........ there is just no damn way that authors are going to use it.
      - They are stuck in using Microsoft Word (horribly).
      - An extremely small minority might read and learn how to use Styles properly.
      - An even smaller minority might jump ship to an open source alternative like Libre/Open Office.
    - I can see this maybe being aimed as a tool for intermediaries, who can use the tool themselves to quickly clean/markup input... which will make their lives easier/faster.
- Output
  - What do you include? (Those marked with (?)... how are you going to mark these up?)
    - Headers
    - Paragraphs
    - Blockquotes
    - Left/Center/Right/Justified
    - Tables
    - Footnotes (?)
    - Poetry (?)
    - Pullquotes (?)
    - Indexes (?)
    - Figures (?)
    - Images (?)
      - Floating images (?)
      - SVG (?)
    - Captions (?)
    - Boxed text (?)
    - Math (?)
      - Formulas (?)
      - Fractions (?)
  - In non-fiction books, page numbers are a HUGE problem. "See Footnote 3 on page 5".
    - I see how LaTeX/LyX handles it, by placing tags/ids/references, but from what I gathered (I haven't tackled a proer EPUB->LaTeX->PDF conversion yet)... this will take a while to mark up properly, and make sure everything is correct.
Intermediate.
- How in-depth do you want this intermediary to go?
 - Do you mark titles of books as a different class?
 - Title of Book
 - Do you mark foreign languages (which might need a different font/treatment... depending on the output format?)
 - Greek words
 - Do you mark down people's names?
 - First Last
 - This much in-depth markup will be extremely useful in an output format (let us say I wanted to use LaTeX to auto-generate an Index for me. Having a list of names/titles of books might be extremely helpful to have marked as different classes).
 - Going so in-depth, while it might be FANTASTIC in the long-run, will be a complete pain to initially mark everything up. (Which is why I avoid it).
 - The cost will go up prohibitively (As Hitch has mentioned, these conversions are expected to be done for pennies.)
 - If everything is marked up properly the first time, it will be a "one button press" conversion.... Although we understand this.... most of these authors just run it through Calibre, run it through some horrible automated system like Smashwords, or pay for a cheap crappy conversion.

Let me just reiterate, I am an extremely small minority of the users. (I am one of the few here who is paid to convert (most here do it for personal usage or as a hobby)).

Non-fiction is much harder/more complex than just handling your simple fictional work (which is probably the vast majority of writers getting books converted).

I try to push consistency across all of my books, so that it will make it way easier to swap things around if needed. For example, we had a ton of discussion in this topic about footnotes: https://www.mobileread.com/forums/sho...d.php?t=225045

I treat them the same across all my books, so I can easily just regex them if needed (early on I used to have superscript footnotes, now I have them in the [##] format).

skreutzer · 02-03-2014, 08:40 PM

@dickloraine:

The company made a great effort in order to implement an automated processing workflow, which isn't different from what everybody in the field does. However, they saved themselves much more effort of hand formatting all their titles, so in digital publishing (especially in volume) there aren't that much alternatives than to change the old publishing process to an automated one. Their 250 styles represent 250 print layouts.

Target: At first I myself for my own book projects, and other autors/publishers as well. Whoever wants to save time or costs caused by manual formatting.

Editing and writing are completely different tasks than the task of formatting and typesetting, so authors writing their unformatted manuscript and editors are not the target users, while they still may benefit from an automated processing workflow if they do formatting by themselves after they've finished writing, or if they have somebody to do the formatting for them. If authors apply formatting while writing, they should use style templates instead of direct formatting. It depends on the author or formatter, if a file is usable for an automated processing workflow. If it isn't, they won't benefit.
There's no need to reinvent the wheel. XHTML is a common export format for many writing applications and word processors, also it is omnipresent in web and e-book context, so most backends will take XHTML as input format, while other input formats with corresponding backends may be supported as well.

If somebody's text is in a custom XML format (with or without schema), he may do a transformation to the input format of the preferred backend. In case one wants or needs some special features, he could adjust an existing backend to support his custom XML format as input format.
Yes, EPUB and XHTML, back and forth and also in many other directions.
Schema validation can be integrated into the workflow, but it can also be left out (by simply omitting unexpected constructs of the input file).
If your XML format is in common use, it is most likely that converters to input XML formats already exist (and if not, they might be developed then). If it is your own custom XML format, you might implement a transformator yourself or get somebody to help you with it, since I guess you would also have your own custom software to read and write your custom XML format already.
Maybe you can't generate your desired output that way, and customizing a complex software which isn't implemented as a series of individual small tools might be hard or time-consuming. If you already can generate your desired output at conditions acceptable for you, there's no need to change your system just for the sake of changing, is it? For instance, as for my book projects, manual work isn't an option.

Yes, I have to invest more time in developing an automated processing workflow than I would have to invest into doing the manual work for one project, but as there are several projects of different sizes, certainly at some time there would be a point of break-even, so once solved, always solved. Additionally, not only I myself need automation, other people might need it as well, and sharing solutions with them and collaborate in developing solutions saves time for everybody in comparison with manual work or reinventing the wheel and keep it unshared, proprietary, non-free.
Seems like a lot of existing word processors can already be used to export XML, since XHTML is XML. I thought about providing ordinary writers a JavaScript based markup tool with XHTML output, but such a thing has a high probability to already exist, and at the moment I guess predefined template documents for OpenOffice/LibreOffice are quite a good way to encourage ordinary authors to apply semantic markup, so their output can feed directly into an automated processing workflow.

I would absolutely go into the discussion about free software, because there are already lots and lots of proprietary, non-free "solutions" around, which you neither are able/allowed to access nor you're technically able to use them because of the dependency on a proprietary environment and proprietary tools. As for self-publishers, it seems there is no solution available at all at the moment which can be used independently.

You definitively should learn more about free software. Proprietary software might at any time prevent you from doing the job. The "open" approach risks to become proprietary at any time. You aren't forced to switch your operating system, to the contrary - there are so much portations of free software to proprietary operating systems available, that one gets almost all benefits from the free world even in an proprietary environment (since free software isn't discriminatory and can freely be used and modified, no matter what the operating system environment might be), while proprietary software is completely unusable on several levels in the free software world due to artificial restrictions.

As it is very likely that all software used within the automated processing workflows are also available on proprietary systems and the workflow implementation itself can be easily ported to such, you should not expect that I myself have any interest in doing so and wasting time on things which are quite contrary to the initial goal of the project, which is developing automated processing workflows with and as free software. If such a workflow is dependend on proprietary software, there's absolutely no way to run it while still respecting the digital freedoms of the user, since the user is forced to use non-freedom-respecting software in order to run the free software.

@Arios:

I don't think it is too ambitions, since I've already implemented such an automated processing workflow in the past for one of my own projects (in case you haven't found the link in the Sigil thread): http://www.freie-bibel.de/official/p...lisierung.html (unfortunately, the text itself is in German, but you might still look at the images). I'm now only trying to generalize it while making it as easy as possible to use, which will take some time and some efforts, but I'm confident that over time some progress can be made.

You are right, I should try harder to communicate the concept and idea behind what kind of solution I have in mind, since it almost looks repetitive in this discussion what the details of such an implementation would be. Usually, I would like best to just do the programming and demonstrating the results as real-world application, but at the other hand I started this thread to ask some questions, give some updates, etc., so some kind of project description could be quite useful for people who just want to look into if they could make use of an automated processing workflow or not.

I'm actually planning to provide support for writer2xhtml standalone. Writer2latex doesn't seem that usable to me at the moment, since it requires ODT (which is XML) as input, so in order to make use of a backend, an application would be required to write ODT or to convert to it. I would consider LaTeX as "irreversible" output format since it isn't XML and can't directly be used in an XML based workflow (while LaTeX to XHTML or LaTeX to ODT doesn't seem to be high-priority). At the moment, I'm aiming for ODT to XHTML and XHTML to EPUB and LaTeX (or XSL-FO), so the latter part can even be used with XHTML input from any application and without requiring ODT, while EPUB and LaTeX output would be available for ODT and XHTML sources.

Free software isn't a mere technical issue, but to look at the technical aspect: as you know, LibreOffice is written in Java, so there's a need to have a JavaVM implemented for a proprietary operating system, which is a competing product to the VMs of proprietary operating systems, so the operating system environments might make it artificially difficult for a Java VM implementation.

If Oracle decides to discontinue its proprietary VM implementations for the proprietary operating systems, all freely licensed Java software will get effectively be unusable on proprietary operating systems, leaving only the free implementation of a Java VM (OpenJDK) intact - and I guess there wouldn't be much interest in porting it on the proprietary operating systems for obvious reasons (lots of work, no gain at all, contrary to the goals of free software). So the LibreOffice ports to proprietary systems are in constant danger of loosing their technical foundation, and if LibreOffice can't be supported any longer on proprietary operating systems, just guess what its users will do then: they'll switch to a proprietary word processor. Since Java bytecode is portable, Java programs are automatically cross-platform, but it still depends on licensing, if you are allowed to use, modify, distribute a program or not.

To demonstrate the benefits of semantic encoding, especially if combined with automated processing, I would provide a set of default layouts with the automated processing workflow, including ODT document templates with predefined styles. Then I would show how text in the ODT documents could be easily rendered into PDF and EPUB without manual adjustments, as long as the template styles are supported by the backend and the template styles get applied semantically in the ODT document (additionally showing the opportunity that the visual appearance in the generated files can easily be changed without the need of manual adjustments to the ODT by just changing the style implementation in the backend).

I observed that some of the self-publishers use predefined Microsoft word templates for print formatting, but they have to do e-book formatting separately (and keep both in sync), or they do both by manual direct formatting. I haven't got a better idea to show the benefits of semantic encoding than demonstrating that this manual efforts aren't necessary if semantic formatting is applied in the first place.

@Tex2002ans:

I know a pretty good EPUB3 reading software: it's called "webbrowser" ;-) No, seriously, just display the TOC in a side pane of a browser window, that's it.

"Infogrid: Digital Publisher": yes, exactly such an automated processing workflow like they have, but for everyone as free software.

Regarding your other observations: for the input, at some point in time, somebody has to apply formatting to a text in order to prepare it for output generation. At this point, whoever and whenever it is, one has to decide between direct formatting and semantic markup. If the decision is made in favor of semantic markup, it will be beneficial for the person who did the decision, and if not, this person will exclude himself from the benefits of automated processing. It can be the author, it can be an intermediar. For the output, I don't think a "one size fits all" approach would be the best way to handle it, I would rather develop several individual processors for different features and layouts. Initially, I would aim for basic book features, and maybe there might be more complex document representation as well (probably by integration of more sophisticated software that's already existing for such tasks?).

However, as you mentioned the specific example of referencing, such things aren't difficult at all for automated processing, and as EPUB->LaTeX->PDF is basically XHTML->LaTeX->PDF, I already do something very similar quite often (however, not with the goal of representing the XHTML visual appearance of a browser in the PDF). For your question on a intermediate format: there might be lots of intermediate formats, some probably more complex, some more simple. If somebody wants to implement complex output, he also defines how the input should look like (which markup is supported by the backend), so input files can be transformed to this "public specified interface" of the backend in order to produce the output, be it by one intermediate step or several intermediate steps as part of a larger workflow. Just combine some scripts that do the transformations or add data/structure. However, I've started pretty simple with only basic features, and expand from there as needed.

Consistency across all processed books will automatically be guaranteed, since the processing workflow won't change it's internals by itself ;-) If the initial layout definition is done right in the first place, I also think this is an advantage for the ordinary writer to get quality layout, eliminating the risk of accidentally inserting errors into the design.

PeterT · 02-03-2014, 10:33 PM

One minor suggestion. I realize English is not your first language but you might want to consider shorter paragraphs in your posts.

While I might be interested in what you are saying, I do find my eyes glazing over at the walls of texts I see in your posts.

skreutzer · 02-09-2014, 09:05 AM

Thanks for your hint! I'm used to read long texts, so I tend to write long texts. In any case, I'm glad to improve any of my posts, please just notify me (probably by PN, if you don't want to use a discussion thread for it).

roger64 · 02-09-2014, 10:01 PM

Quote:

Originally Posted by skreutzer

.../...
There's no need to reinvent the wheel. XHTML is a common export format for many writing applications and word processors, also it is omnipresent in web and e-book context, so most backends will take XHTML as input format, while other input formats with corresponding backends may be supported as well.
.../...
I'm actually planning to provide support for writer2xhtml standalone.
.../...

This is great!

Tex2002ans · 02-09-2014, 11:13 PM

Quote:

Originally Posted by skreutzer

Thanks for your hint! I'm used to read long texts, so I tend to write long texts. In any case, I'm glad to improve any of my posts, please just notify me (probably by PN, if you don't want to use a discussion thread for it).

I don't mind the long tomes. (I write them myself!)

But it does make it look less daunting if you actually use the quote boxes and answer each thing in chunks!

Quote:

Originally Posted by skreutzer

However, as you mentioned the specific example of referencing, such things aren't difficult at all for automated processing, and as EPUB->LaTeX->PDF is basically XHTML->LaTeX->PDF, I already do something very similar quite often (however, not with the goal of representing the XHTML visual appearance of a browser in the PDF).

Hmmm... I would be interested in how you handle this in the automated PDF stage (particularly hyphenation/linebreaks/ragged edges/widows/orphans). Let me know more via PM.

SBT · 02-11-2014, 07:12 AM

What about TEI? Popular, loads of tools, XSLT for EPUB Conversion?

Arios · 02-13-2014, 03:39 PM

skreutzer,

Sorry to be so late and thanks for your reply (post # 18).

I don't know you skreutzer, but I'm sure you can do it. So "ambitious" in my previous post was not the good term. Sorry about that. I'd say now: "too much details". And, by the way, your project is very interesting.

So my request is just to asking you to point out shortly the main elements of your project:

Background: I'm (you in fact!) a C++ programmer and
I plan to work exclusively within open source OS and apps.
The main goals of my project is to...
...
I have submitted my project here on mobileread to know if there are some interests for it; I hope to have feedbacks and testing from you guy.
...

Do you see what I ask for?

Cheers!

Tex2002ans · 02-20-2014, 02:41 PM

While looking through some articles, I stumbled upon these two, which I thought might be of interest in the case of automating book workflows. Perhaps you might be able to gather some gems from this discussion/articles:

http://programming.oreilly.com/2013/...uthorship.html
http://www.balisage.net/Proceedings/...einfeld01.html

Glad to see that other great minds think alike, with the "digital first" and then work backwards to print... instead of the dreadful waste that is happening currently by going the other way around.

skreutzer · 02-22-2014, 10:10 AM

Quote:

Originally Posted by SBT

What about TEI? Popular, loads of tools, XSLT for EPUB Conversion?

Yes, TEI is fine for automated processing workflows, but neither is TEI an input format (I never received TEI from anybody, also word processors don't support it as output format), nor an output format (no TEI readers). But TEI could be used as intermediate format, so I'll certainly look into the TEI processing tools. It might save the time to implement PDF and EPUB generators, and if TEI tools can be integrated into an automated workflow which is easy to set up and to use, I guess it could be quite advantageous.

Quote:

Originally Posted by Arios

So my request is just to asking you to point out shortly the main elements of your project.

Background: C/C++, Java, Web (PHP, JavaScript) and other stuff.
I plan to license exclusively under free software licenses (GNU AGPL3 or later for software, CC BY-SA 4 for non-software which I write initially, usage of and contributions to FSF-approved free software licenses.
The main goal of my project is to develop an automated digital publishing workflow, mostly based on XML processing. It is intended to eliminate the current practice of setting type by hand in the digital age, which is caused by direct formatting for each desired output format.
I haven't submitted my project here on mobileread, but requested some hints in the forum regarding tools which can be used for such a workflow. I started to generalize the methods of my Bible digitalization and reproduction project 1 2 3, with some results published in my public personal git repository.

Note: as mentioned, this freely licensed software can also be used (with no or little adjustments) in non-free, proprietary environments (which is part of the digital freedoms a user deserves), while it could be hard or even impossible to integrate non-free, proprietary tools into the free workflow or free environment, if it doesn't at least support some open formats, protocols, etc.

Quote:

Originally Posted by Tex2002ans

While looking through some articles, I stumbled upon these two, which I thought might be of interest in the case of automating book workflows.

Thank you very much, I'll read them and maybe comment. There's a lot going on in the field of digital publishing, so I'm glad that you've pointed me to those articles, which I otherwise certainly would have missed.

In the meantime I've started a little Java GUI programming, so there's now a metadata editor for the configuration file of my EPUB generator. I plan to add another GUI helper for the configuration file, so that the entire EPUB conversion can be managed not only by an automated processing workflow, but also by ordinary users. I would like to set up a basic processing workflow, both usable automatically and by GUI, so that real texts can be processed with it.

I found out that OpenOffice/LibreOffice can be configured in a way that semantic markup can be applied quite conventiently. Further, I'm involved in automatically processing output generated by a tool which reads a Wiki software, so Wikis as online front-ends seem to be pretty convenient, probably semantic, text editors for special kinds of writing activity.

I also found a freely licensed automated processing workflow called Booktype, which I will investigate as well as the TEI processing tools. Probably it is easier than expected to provide an freely licensed automated processing workflow, by just combining what's already there to some kind of a "single installer", and by making easy to use for everybody (both ordinary users as well as professional formatters/typesetters). And sorry, no new River Valley TV video link, my progress is too slow and the videos are too interesting for me personally, even when not directly related to the topic of automated digital publishing.

skreutzer · 03-13-2014, 01:26 PM

I haven't made any progress in the fields mentioned above in the last time because I was busy with implementing GUI helpers for a first primitive automated processing workflow, see http://vimeo.com/89003773 (only the pictures if you don't speak German). This should demonstrate how the automated workflow could be used manually. From there, I might extend it by PDF generation, ODT to XHTML integration, layout definition etc.

Arios · 03-13-2014, 04:15 PM

skreutzer,

Thanks for your reply # 26 (I'm so slow sometimes!).

Now things are clearer.

Tex2002ans · 03-13-2014, 06:40 PM

Quote:

Originally Posted by skreutzer

I haven't made any progress in the fields mentioned above in the last time because I was busy with implementing GUI helpers for a first primitive automated processing workflow, see http://vimeo.com/89003773 (only the pictures if you don't speak German). This should demonstrate how the automated workflow could be used manually.

Fantastic start, keep up the good work!

skreutzer · 04-03-2014, 06:37 PM

Some pretty basic XHTML to LaTeX to PDF conversion added and demonstrated by a workflow based upon the principle of Single Source Publishing:

https://vimeo.com/90901780

Sorry, only in German language :-(

02-03-2014, 08:40 PM	#18
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	@dickloraine: The company made a great effort in order to implement an automated processing workflow, which isn't different from what everybody in the field does. However, they saved themselves much more effort of hand formatting all their titles, so in digital publishing (especially in volume) there aren't that much alternatives than to change the old publishing process to an automated one. Their 250 styles represent 250 print layouts. Target: At first I myself for my own book projects, and other autors/publishers as well. Whoever wants to save time or costs caused by manual formatting. Editing and writing are completely different tasks than the task of formatting and typesetting, so authors writing their unformatted manuscript and editors are not the target users, while they still may benefit from an automated processing workflow if they do formatting by themselves after they've finished writing, or if they have somebody to do the formatting for them. If authors apply formatting while writing, they should use style templates instead of direct formatting. It depends on the author or formatter, if a file is usable for an automated processing workflow. If it isn't, they won't benefit. There's no need to reinvent the wheel. XHTML is a common export format for many writing applications and word processors, also it is omnipresent in web and e-book context, so most backends will take XHTML as input format, while other input formats with corresponding backends may be supported as well. If somebody's text is in a custom XML format (with or without schema), he may do a transformation to the input format of the preferred backend. In case one wants or needs some special features, he could adjust an existing backend to support his custom XML format as input format. Yes, EPUB and XHTML, back and forth and also in many other directions. Schema validation can be integrated into the workflow, but it can also be left out (by simply omitting unexpected constructs of the input file). If your XML format is in common use, it is most likely that converters to input XML formats already exist (and if not, they might be developed then). If it is your own custom XML format, you might implement a transformator yourself or get somebody to help you with it, since I guess you would also have your own custom software to read and write your custom XML format already. Maybe you can't generate your desired output that way, and customizing a complex software which isn't implemented as a series of individual small tools might be hard or time-consuming. If you already can generate your desired output at conditions acceptable for you, there's no need to change your system just for the sake of changing, is it? For instance, as for my book projects, manual work isn't an option. Yes, I have to invest more time in developing an automated processing workflow than I would have to invest into doing the manual work for one project, but as there are several projects of different sizes, certainly at some time there would be a point of break-even, so once solved, always solved. Additionally, not only I myself need automation, other people might need it as well, and sharing solutions with them and collaborate in developing solutions saves time for everybody in comparison with manual work or reinventing the wheel and keep it unshared, proprietary, non-free. Seems like a lot of existing word processors can already be used to export XML, since XHTML is XML. I thought about providing ordinary writers a JavaScript based markup tool with XHTML output, but such a thing has a high probability to already exist, and at the moment I guess predefined template documents for OpenOffice/LibreOffice are quite a good way to encourage ordinary authors to apply semantic markup, so their output can feed directly into an automated processing workflow. I would absolutely go into the discussion about free software, because there are already lots and lots of proprietary, non-free "solutions" around, which you neither are able/allowed to access nor you're technically able to use them because of the dependency on a proprietary environment and proprietary tools. As for self-publishers, it seems there is no solution available at all at the moment which can be used independently. You definitively should learn more about free software. Proprietary software might at any time prevent you from doing the job. The "open" approach risks to become proprietary at any time. You aren't forced to switch your operating system, to the contrary - there are so much portations of free software to proprietary operating systems available, that one gets almost all benefits from the free world even in an proprietary environment (since free software isn't discriminatory and can freely be used and modified, no matter what the operating system environment might be), while proprietary software is completely unusable on several levels in the free software world due to artificial restrictions. As it is very likely that all software used within the automated processing workflows are also available on proprietary systems and the workflow implementation itself can be easily ported to such, you should not expect that I myself have any interest in doing so and wasting time on things which are quite contrary to the initial goal of the project, which is developing automated processing workflows with and as free software. If such a workflow is dependend on proprietary software, there's absolutely no way to run it while still respecting the digital freedoms of the user, since the user is forced to use non-freedom-respecting software in order to run the free software. @Arios: I don't think it is too ambitions, since I've already implemented such an automated processing workflow in the past for one of my own projects (in case you haven't found the link in the Sigil thread): http://www.freie-bibel.de/official/p...lisierung.html (unfortunately, the text itself is in German, but you might still look at the images). I'm now only trying to generalize it while making it as easy as possible to use, which will take some time and some efforts, but I'm confident that over time some progress can be made. You are right, I should try harder to communicate the concept and idea behind what kind of solution I have in mind, since it almost looks repetitive in this discussion what the details of such an implementation would be. Usually, I would like best to just do the programming and demonstrating the results as real-world application, but at the other hand I started this thread to ask some questions, give some updates, etc., so some kind of project description could be quite useful for people who just want to look into if they could make use of an automated processing workflow or not. I'm actually planning to provide support for writer2xhtml standalone. Writer2latex doesn't seem that usable to me at the moment, since it requires ODT (which is XML) as input, so in order to make use of a backend, an application would be required to write ODT or to convert to it. I would consider LaTeX as "irreversible" output format since it isn't XML and can't directly be used in an XML based workflow (while LaTeX to XHTML or LaTeX to ODT doesn't seem to be high-priority). At the moment, I'm aiming for ODT to XHTML and XHTML to EPUB and LaTeX (or XSL-FO), so the latter part can even be used with XHTML input from any application and without requiring ODT, while EPUB and LaTeX output would be available for ODT and XHTML sources. Free software isn't a mere technical issue, but to look at the technical aspect: as you know, LibreOffice is written in Java, so there's a need to have a JavaVM implemented for a proprietary operating system, which is a competing product to the VMs of proprietary operating systems, so the operating system environments might make it artificially difficult for a Java VM implementation. If Oracle decides to discontinue its proprietary VM implementations for the proprietary operating systems, all freely licensed Java software will get effectively be unusable on proprietary operating systems, leaving only the free implementation of a Java VM (OpenJDK) intact - and I guess there wouldn't be much interest in porting it on the proprietary operating systems for obvious reasons (lots of work, no gain at all, contrary to the goals of free software). So the LibreOffice ports to proprietary systems are in constant danger of loosing their technical foundation, and if LibreOffice can't be supported any longer on proprietary operating systems, just guess what its users will do then: they'll switch to a proprietary word processor. Since Java bytecode is portable, Java programs are automatically cross-platform, but it still depends on licensing, if you are allowed to use, modify, distribute a program or not. To demonstrate the benefits of semantic encoding, especially if combined with automated processing, I would provide a set of default layouts with the automated processing workflow, including ODT document templates with predefined styles. Then I would show how text in the ODT documents could be easily rendered into PDF and EPUB without manual adjustments, as long as the template styles are supported by the backend and the template styles get applied semantically in the ODT document (additionally showing the opportunity that the visual appearance in the generated files can easily be changed without the need of manual adjustments to the ODT by just changing the style implementation in the backend). I observed that some of the self-publishers use predefined Microsoft word templates for print formatting, but they have to do e-book formatting separately (and keep both in sync), or they do both by manual direct formatting. I haven't got a better idea to show the benefits of semantic encoding than demonstrating that this manual efforts aren't necessary if semantic formatting is applied in the first place. @Tex2002ans: I know a pretty good EPUB3 reading software: it's called "webbrowser" ;-) No, seriously, just display the TOC in a side pane of a browser window, that's it. "Infogrid: Digital Publisher": yes, exactly such an automated processing workflow like they have, but for everyone as free software. Regarding your other observations: for the input, at some point in time, somebody has to apply formatting to a text in order to prepare it for output generation. At this point, whoever and whenever it is, one has to decide between direct formatting and semantic markup. If the decision is made in favor of semantic markup, it will be beneficial for the person who did the decision, and if not, this person will exclude himself from the benefits of automated processing. It can be the author, it can be an intermediar. For the output, I don't think a "one size fits all" approach would be the best way to handle it, I would rather develop several individual processors for different features and layouts. Initially, I would aim for basic book features, and maybe there might be more complex document representation as well (probably by integration of more sophisticated software that's already existing for such tasks?). However, as you mentioned the specific example of referencing, such things aren't difficult at all for automated processing, and as EPUB->LaTeX->PDF is basically XHTML->LaTeX->PDF, I already do something very similar quite often (however, not with the goal of representing the XHTML visual appearance of a browser in the PDF). For your question on a intermediate format: there might be lots of intermediate formats, some probably more complex, some more simple. If somebody wants to implement complex output, he also defines how the input should look like (which markup is supported by the backend), so input files can be transformed to this "public specified interface" of the backend in order to produce the output, be it by one intermediate step or several intermediate steps as part of a larger workflow. Just combine some scripts that do the transformations or add data/structure. However, I've started pretty simple with only basic features, and expand from there as needed. Consistency across all processed books will automatically be guaranteed, since the processing workflow won't change it's internals by itself ;-) If the initial layout definition is done right in the first place, I also think this is an advantage for the ordinary writer to get quality layout, eliminating the risk of accidentally inserting errors into the design. Last edited by skreutzer; 02-09-2014 at 09:03 AM.

02-13-2014, 03:39 PM	#24
Arios A curiosus lector! Posts: 463 Karma: 2015140 Join Date: Jun 2012 Device: Sony PRS-T1, Kobo Touch	skreutzer, Sorry to be so late and thanks for your reply (post # 18). I don't know you skreutzer, but I'm sure you can do it. So "ambitious" in my previous post was not the good term. Sorry about that. I'd say now: "too much details". And, by the way, your project is very interesting. So my request is just to asking you to point out shortly the main elements of your project: Background: I'm (you in fact!) a C++ programmer and I plan to work exclusively within open source OS and apps. The main goals of my project is to... ... I have submitted my project here on mobileread to know if there are some interests for it; I hope to have feedbacks and testing from you guy. ... Do you see what I ask for? Cheers!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Sigil as front end for automated XML based processing workflows?	skreutzer	Sigil	60	01-29-2014 12:10 PM
Workflows to use Calibre with iOS Apps: Good Reader-PDFs, Marvin-epub, Kindle-mobi?	crashnburn	Calibre	4	06-14-2013 04:49 PM
Bug in Kobo processing of epub files causing hang in "Processing content"	BensonBear	Kobo Reader	21	12-21-2012 05:47 AM
Sideloading + Annotations and Highlights Workflows?	jddunn	Kindle Fire	5	12-13-2012 03:59 AM
Other Non-Fiction Stallman, Richard M.: Free Software, Free Society, PDF v1.0, 4 March 2009	scottdw	Other Books	1	12-15-2011 03:02 PM

02-03-2014, 12:23 PM	#16
Arios A curiosus lector! Posts: 463 Karma: 2015140 Join Date: Jun 2012 Device: Sony PRS-T1, Kobo Touch	skreutzer, I do not know, but I think your project is very ambitious. I have no problem with that, but this could complicates things, at least initially. By the way could you summarize point by point, in list form, the main aspects and goals of your project? (dickloraine has worded some good questions about it). You know, sometimes trees hide the forest! I think the idea suggested by roger64 is very good, and if that is the case why not start with the 'easy' stuff? Let me explain. Writer2LaTeX/Writer2xhtml has not been updated for some time, so why not start with this update if you can do it? The source code is available and can be modified within the GNU LGPL if I'm not mistaken. I am convinced that hardcore Writer2xhtml users will be happy to suggest you what should be improved. Another positive aspect of this suggestion is that it allows you to put aside the issue of free and non-free os/apps since LibreOffice or AOO are cross-platform. Finally, an another interesting thing to do would be to create, as you suggested yourself, a kind of wizard which will guide users to better understand the enormous interest of semantic encoding. With suitable templates and some macros for example, the end-user would be offered to create a project (within LO or AOO), step by step, and show him what to do to properly encode his or her documents.

02-03-2014, 10:33 PM	#19
PeterT Grand Sorcerer Posts: 12,167 Karma: 73448616 Join Date: Nov 2007 Location: Toronto Device: Nexus 7, Clara, Touch, Tolino EPOS	One minor suggestion. I realize English is not your first language but you might want to consider shorter paragraphs in your posts. While I might be interested in what you are saying, I do find my eyes glazing over at the walls of texts I see in your posts.

02-09-2014, 09:05 AM	#20
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	Thanks for your hint! I'm used to read long texts, so I tend to write long texts. In any case, I'm glad to improve any of my posts, please just notify me (probably by PN, if you don't want to use a discussion thread for it).

02-11-2014, 07:12 AM	#23
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	What about TEI? Popular, loads of tools, XSLT for EPUB Conversion?

02-20-2014, 02:41 PM	#25
Tex2002ans Wizard Posts: 2,297 Karma: 12126329 Join Date: Jul 2012 Device: Kobo Forma, Nook	While looking through some articles, I stumbled upon these two, which I thought might be of interest in the case of automating book workflows. Perhaps you might be able to gather some gems from this discussion/articles: http://programming.oreilly.com/2013/...uthorship.html http://www.balisage.net/Proceedings/...einfeld01.html Glad to see that other great minds think alike, with the "digital first" and then work backwards to print... instead of the dreadful waste that is happening currently by going the other way around.

03-13-2014, 01:26 PM	#27
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	I haven't made any progress in the fields mentioned above in the last time because I was busy with implementing GUI helpers for a first primitive automated processing workflow, see http://vimeo.com/89003773 (only the pictures if you don't speak German). This should demonstrate how the automated workflow could be used manually. From there, I might extend it by PDF generation, ODT to XHTML integration, layout definition etc.

03-13-2014, 04:15 PM	#28
Arios A curiosus lector! Posts: 463 Karma: 2015140 Join Date: Jun 2012 Device: Sony PRS-T1, Kobo Touch	skreutzer, Thanks for your reply # 26 (I'm so slow sometimes!). Now things are clearer.

04-03-2014, 06:37 PM	#30
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	Some pretty basic XHTML to LaTeX to PDF conversion added and demonstrated by a workflow based upon the principle of Single Source Publishing: https://vimeo.com/90901780 Sorry, only in German language :-(