Automated Processing Workflows as and with Free Software - Page 3

roger64 · 04-03-2014, 11:27 PM

Hi

Thank you for this. German is not a problem. But vimeo seems to be a real one in China...

skreutzer · 04-05-2014, 10:54 AM

Thank you very much for pointing out this issue! One of the most important goals of the stuff I do is to make sure that nobody ever gets excluded, because the basic freedom of unrestricted access plays an important role for the foundation of a free (digital) society. As Vimeo and too YouTube are censored in China, hopefully my own personal homepage isn't:

http://www.skreutzer.de/digitales_pu...1_workflow.ogv
http://www.skreutzer.de/digitales_pu...1_workflow.ogv

I use Vimeo and YouTube primarily for the outsourcing of video files and for SEO. I could host the files myself, but I don't want to waste the resources if chances are low that people would find it there, so I only do it on request, if it makes sense.

Please note that the current links don't provide any sort of anonymity or encryption, and since I don't convert the video files into some proprietary video format, you would need to play them with a video player that is capable of the most open video format Ogg Theora, for which freely licensed player software is commonly available.

Another option is that I write short articles about the software, as blog posts or as tutorial/description, only containing text and screenshots, maybe even in English. This way, much less bandwidth and webspace would be wasted to convey the same information.

roger64 · 04-06-2014, 06:15 AM

Hi

Thank you very much for providing an alternative link which did work from
China (all in all, I needed two hours to download your video, i.e. 100 minutes for the first three minutes, but the remaining part in 20 minutes).

Your 142 meg video file clearly demonstrates how, from a clean html file (book), your bash script has the ability to produce both a clean EPUB and a clean PDF, -including metadata thanks to a second bash script. You also show how you can add some text and a cover to the original file and produce again a complete EPUB and PDF.

So congratulations for this first but quite important step.

What will be the next step? When do you plan to release your script(s)?

skreutzer · 04-07-2014, 02:08 PM

Actually the toolset consists of several individual standalone programs written in Java (in order to be portable over computer architecture and operating system environments), and the non-portable shell scripts are currently used to glue them together.

Those tools are already released, because I do open development in a github repository (hopefully not blocked in China):

https://github.com/skreutzer/automat...tal_publishing

I currently don't provide pre-built packages yet. If you're interested in trying those tools out, you have the option to build the executables yourself (an installed JavaSDK would be required), or I could release such (an installed JavaVM would be required). Further, the shell scripts might not be portable across all kinds of operating system environments, but you could still use the tools manually. For PDF generation, a LaTeX installation is required, since LaTeX is the only supported backend up to now (XSL-FO and iText may follow).

The software is licensed under the GNU Affero General Public License version 3 or any later version, so you're permitted to do:

Run the program without any restrictions.
Modify and customize the program, as long as nobody else uses it except you.
Distribute the program to other people, commercially or non-commercially, as long as you ensure that receivers can obtain the source code.
Modify the program and distribute the modified version to other people, commercially or non-commercially, as long as you ensure that receivers can obtain the source code, including all modifications.
Use the program, with or without modifications you made to it, together with extensions you wrote for yourself, which you can release under a license of your choice, but I want to strongly encourage you to use GNU AGPL 3 or later.
Run the program on a server which is accessible through a network in order to provide an online service, as long as you ensure that receivers can obtain the source code of the program you're running (with or without modifications).

You are prohibited from doing:

Distributing executables without providing access to the corresponding source code.
Removing copyright notices.
Sublicensing files which were covered by the GNU AGPL or later to other people under a different license than GNU AGPL 3 or later.
Running the program on a server which is accessible through a network without providing access to the corresponding source code you're running on that server.
Distribute the program in devices which are set up in a way that a user can't run his modified version of the program in it, while the device is generally able to load and execute custom programs.

This way, the license ensures that the software can be used to the greatest extend possible, only preventing you from actions which would violate the digital rights of other users. The license is intended to promote open collaboration and contribution to a free society, and I would really appreciate if you would improve the software, while each contribution, be it a small one or a big one, would be to the benefit of all users and developers altogether.

Please note that the current features of the software aren't special at all, they're quite common for publishing houses and institutes that use proprietary software, and that there are already much more sophisticated solutions existing under a free license:

Maybe it could be possible to interoperate with those systems, while my concept is a little different: the processing tools should also be usable outside a large and complex processing workflow, so that they can be used for small tasks as well and re-arranged with a higher flexibility for all kinds of custom and general tasks.

skreutzer · 04-11-2014, 05:11 PM

I've just started to provide pre-packaged downloads for the software:

http://www.skreutzer.de/digitales_pu...ren/index.html

Depending on the version of the installed JavaVM, choose the 1.6 or 1.7 package, depending on the operating system environment, choose the GNU (unix-like) or Windows package. After extracting the Zip archive, one will find run.sh or run.bat files, which usually can be executed with a double click. In case of questions, difficulties or suggestions, just post them here!

skreutzer · 05-11-2014, 02:22 PM

And now there's support for OpenOffice/LibreOffice as front-end for the publishing system, as ODT to HTML conversion is made compatible and got combined with EPUB and PDF generation. A document template is used to provide a set of pre-defined styles for semantic markup, which serves as some kind of “contract” or “agreement” between front-end and back-end, so that the back-end knows what to expect and the formatter at the front-end knows which styles he's supposed to apply to the text. Demonstration video (again, only in German language):

http://vimeo.com/94766226

And for German speakers in China who are restricted from accessing Vimeo or YouTube:

http://www.skreutzer.de/digitales_pu...l_workflow.ogv

roger64 · 05-11-2014, 04:46 PM

Hi

Thank you very much for providing these tools and videos. I'll be very keen to test your EPUB and PDF output from odt files. I have seen only from your site the 1.7. GNU version for html files and I suppose you'll upload them soon.

You are progressing very fast and it's a pleasure to see somebody running again along paths that make me remember of writer2xhtml whose so promising development stopped two years ago for unexplained reasons.

skreutzer · 05-12-2014, 04:54 PM

I've just uploaded the 1.7 GNU package of the #46 commit. Please note that the odt2html tool will either just convert the raw, plain text and semantical structure information (style names) to HTML. This is quite a different approach than writer2xhtml, which tries to do a “visually lossless” conversion from ODT to HTML by resembling and translating ODT formatting to HTML formatting. My tools just ignore all ODT formatting and apply other styles to it. OTT (OpenOffice/LibreOffice document templates) can be used to pre-define such styles, and their actual implementation for HTML, EPUB or PDF can be changed by modifying the corresponding XSLTs (maybe later there could be helper tools for doing so).

Regarding of the execution performance, I guess there are some factors that slow it down and can be optimized. For instance, the first invocation of the JavaVM takes a very long time if it isn't already running. Further, shell output is comparatively slow, even while Unix shells are pretty fast in general, but if output is dumped to a file, it could probably be even faster (especially pdflatex is very likely much faster, if it doesn't have to wait for writing to the terminal buffer). Additionally, for processing HTML, there are the DTDs involved, which get loaded into memory I guess (as DOM?), and with some adjustments, it might changed to plain XML processing, so it might be faster, too. I myself usually don't optimize for performance in my own programming, because if a program is slow, one can still wait a little longer, while the hardware of the ordinary user of today is incredible fast. On the other hand, if an environment has tight memory limits, there's no way to do anything about it. So a slower program may still produce results where a memory-greedy one doesn't, but I guess I'm far from being affected by such considerations, because there's neither any overhead of any kind nor excessive use of memory space involved in those small tools. In any case, it would be interesting to test the performance with the automatic generation of, let's say, hundred book projects all at once, when some kind of book management facility allows such runs on an entire collection of titles.

After you mentioned writer2xhtml (which is part of the larger writer2latex package) to me, I looked into it a little and even built it from sources if I remember correctly (I might be wrong about that), but there are several issues why I preferred to write a new odt2html converter tool: at first, translating ODT visual appearance to HTML visual appearance would have been a huge overhead, if all that actually matters for automatization is the semantic markup. Then, writer2xhtml is largely based upon OpenOffice/LibreOffice code since it is intended as a plugin for it, so dependencies with the OpenOffice/LibreOffice code base would have been needed to maintain. And the third issue is that writer2xhtml is written as one large monolithical program, so it is hard to adjust for custom needs, requires a certain level of programming skills in order to be maintained and can't be used for other, similar tasks. My approach is more or less a primitive one compared to writer2xhtml, because every step from the source file to the target format is done by a small tool which does nothing else than just one single, defined and limited job, but this chain of individual tools can be combined in all kinds of ways, are easily adjustable and are reusable for other source/target formats. At least to the extent of the currently implemented automated workflow ;-)

However, I still would like to see that the work put into writer2xhtml is of further use, but it might require a huge commitment to investigate the current code and change it in a way that makes it less dependent and more flexible. Anyway, writer2xhtml could be integrated into such automated processing workflows, I even experimented with the writer2xhtml output initially, but writer2xhtml in itself doesn't change much in terms of the original problem, which is the lack of semantic markup in the source document, so it would still translate to “garbage in, garbage out”, while all direct formatting still would needed to be removed from the writer2xhtml output so that only the raw text and semantical, structural information remains.

roger64 · 05-13-2014, 04:11 AM

Quote:

Originally Posted by skreutzer

.../... However, I still would like to see that the work put into writer2xhtml is of further use, but it might require a huge commitment to investigate the current code and change it in a way that makes it less dependent and more flexible. Anyway, writer2xhtml could be integrated into such automated processing workflows, I even experimented with the writer2xhtml output initially, but writer2xhtml in itself doesn't change much in terms of the original problem, which is the lack of semantic markup in the source document, so it would still translate to “garbage in, garbage out”, while all direct formatting still would needed to be removed from the writer2xhtml output so that only the raw text and semantical, structural information remains.

If you can do this work, I think it will definitively be worth it. I had the opportunity to discuss with Henrik Just - the writer2xhtml author -. He planned to design a new GUI when all work suddenly stopped. He was also aware of the “garbage in, garbage out” possibility. But, among the many options writer2xhtml provided, there was one that excluded "hard formatting" (see screenshot) - what you call "direct formatting" - which, I think, could still be of further use because what's left for processing looks akin to your "semantic markup". This is the only one I still use today.

I also have a messy, uncomplete, but for me useful, .ott file that I use consistently. I also export a custom stylesheet which contains the font-face declarations, and some other style definitions. Either, I use them when I finetune my EPUB or I just discard them.

roger64 · 05-13-2014, 11:30 AM

Hi

Forgive me, I did a very unorthodox try using your latest 46 1.7.GNU commit.

I took a big odt file (a whole book, a little under 300k) which I previously used as a source for an EPUB. I used your odt2html and got a sizable output.html file. I guessed I could open it after about 30 seconds but I got no warning information about when the processing exactly ended.

Since I did not know how to follow, I created a new EPUB with the Calibre editor out of this file. Then I imported the two stylesheets from my original EPUB. Within the EPUB, I linked this output.html file to these two stylesheets.

I checked with Calibre. It only complained that the text file was too big. I split it in two and I had a working EPUB which certainly I could read anywhere.

I noticed of course some defects in the display.
- the paragraph styles were nearly properly reported though some stylenames were interspersed with _20_ like Text_20_body instead of Textbody, or Ital_20_droite instead of Italdroite. This was easily corrected. Other paragraph names were properly transcribed (Quotation, Centrage,...).

The main missing points are theses ones:
- the titles were treated as plain paragraph (p class="Heading"). Intermediate h2 tags (chapters) disappeared. So I could not produce a usable toc.ncx
- the small fry (I mean the i, sup, /br, ... tags) were all treated like a common span without any parameter, which means there is of course some transcription work to do in this area.

All in all, I did not expect to get such a quick and workable result with an odt file of this size.

Congratulations!

skreutzer · 05-14-2014, 07:06 PM

It seems this GitHub repository is an attempt to continue development of writer2latex. I just sent another e-mail to Henrik Just, hopefully this time with a response, because at least the SourceForge repository would benefit from a status update regarding the current situation.

Yes, the “Ignore hard formatting” option of writer2xhtml might only retain the raw text and structural markup (with corresponding style names attached to it), which is quite advantageous for automatic processing based on the concept of semantic markup, which is most likely exactly the same as my odt2html does.

For your 300k ODT input file, what exactly happened? How big was the output.html file? Couldn't you open it in an editor, browser or didn't odt2html1 quit (or just disappear after execution, if you've clicked on run.sh)? In any case, it sounds like you've invoked $/odt2html/odt2html1 by yourself on the terminal or via the $/odt2html/odt2html1/run.sh helper as standalone tool instead of the $/workflows/odt2epub1.sh, since your document isn't based upon the style names as defined in $/odt2html/templates/template1/template.ott and therefore $/workflows/odt2epub1.sh wouldn't know how to automatically transform the flat structure from ODT to an hierarchical structure, to split the input into several smaller HTMLs and to package it to EPUB by itself. If the run.sh was used, you could too look into the log.txt file which gets written to the same directory with each new execution. The names of the style classes correspond to the internal identifiers of the ODT, and since characters like space aren't in common use for identifiers, special characters are represented by their ASCII code in hex, separated by underscores, so it's quite easy to match the display names to their internal identifiers. One could translate them automatically back to their original display names, except for cases where a space was involved, because space is used as separator between individual CSS classes in a HTML class attribute. If this textual description isn't clear enough, I might just show you an actual example for it.

If you use odt2html1 as standalone, it's indeed quite primitive and does nothing other than just converting the raw text and structural information from ODT to HTML, while no other transformation is performed at all. Therefore additional tools like html_flat2hierarchical1 or html2epub1_html_chapter.xsl are needed in order to get a more usable result. The reason that titles get treated as plain paragraphs (p class="Heading") is due to how ODT treats them, there's no actual info about the order of headings within the document body itself (maybe in the style definition, I haven't looked into it yet). I wonder about the mention of h2 tags, they're probably technically not in the ODT itself, are they? Or is it the use of a "chapter" style? In any case, the processing backend needs to make sense out of styles like "Heading", "Text_20_body" or "Ital_20_droite" and translate them to something meaningful. For the styles you've used in your own document, such replacements for EPUB generation would be similar to what prepare4hierarchical.xsl, html2epub1_html_part.xsl and html2epub1_html_chapter.xsl are for template1.ott of the $/odt2html/templates/template1 directory, as long as there's no tool nor GUI to do style matching between front-end and back-end yet. I think i, sup and br get ignored completely, and the remaining common span without any parameter is left over from what OpenOffice/LibreOffice puts as “extra” into the ODT, if I remember correctly. Additionally, if OpenOffice/LibreOffice wasn't just used to apply formatting to an already existing raw text but instead to write text in it initially, one can observe an incredible fragmentation of spans all over the place, and I don't know the reason why yet and don't tidy it yet in order to get a clean output.

i, sup, br in almost any case implicate visual markup for italic, superscript and linebreak, regardless if they were (are they?) in the actual ODT, in HTML or in the EPUB (if allowed at all). Basically, for automated processing based upon semantic markup, you're not supposed to mark something as “italic” or “superscript” within your text, you should rather define styles like “emphasis” and “footnotemark” and then later define how they should visually be represented (while still OpenOffice/LibreOffice allows WYSIWYG text editing, even if you later get something that might look a little bit different, since the back-end applied other layouts to the input that was fed to the system). Even if you're not supposed to click the “italic” button, I haven't investigated yet which markup is going into the ODT file, and I'm pretty confident that even a hard i, sup and br could be translated to a style name, as style names would improve the quality of the output file, but again, if that button gets clicked and the goal is quality output that can be used for automated processing, “italic” (even when it is consistently used) isn't associated with any particular meaning and too isn't distinguishable from other italic text of a completely different sort, while both uses whould only share the same visual appearing, which of course isn't machine readable and therefore can't be recognized as separate by an automated processing system.

Hopefully that's not too much text I've written now, but maybe we could experiment a little with use case examples and learn about the issues which aren't solved yet. In any case, thank you very much for your feedback, I'll try odt2html1 with larger documents within the next few days and look at the details a little more deeply, because up to the current version I was more concerned about the whole odt2epub workflow, in which odt2html is just the first of a set of steps, so more careful investigation is needed without doubt ;-)

roger64 · 05-15-2014, 02:36 AM

hi

Thanks for your reply and your interest.

I am afraid you'll get no reply from Henrik Just. He just disappeared from the radar screen. I did try to contact him several times to no avail, including through OpenOffice where he had taken work about 30 months ago or even with the help of a Danish friend, by looking for him on the Danish phone directory.

If you think it could be useful, I will provide you with a link to my source odt file, your converted output.html file, and to the final "hybrid" EPUB. I was just curious to see how a "real life" odt file of book size could be transcribed and, of course, I did not follow for this your recommended workflow.

writer2xhtml provides a very useful "style mapping" for some formatting attributes like superscript, bold, italic. The most frequently used for me are: i, em, b, sup. There are also 14 others. Ignoring them would be quite inconvenient because I would have to invoke endless character style occurrences with OpenOffice to qualify them. That would be a real hassle and a disincentive.

skreutzer · 05-17-2014, 04:48 AM

It would be helpful to investigate the problem with the ODT file you've used, but on the other hand maybe there's a way to find out what went wrong without you disclosing your document to me. For instance, I could try odt2html1 on a large document I make up myself, because I just tried short ones yet. Of course you're free to use whatever workflow you like ;-) That's one of the goals of the project, that users can build their own workflows, so that all tools are modular building blocks and can be invoked separately.

I just looked it up, ODT itself doesn't use i or b (the non-semantic, meaningless “bold” and “italic” buttons in OpenOffice/LibreOffice), internally those direct formatting visual apprearances on character level are implemented as styles. The names of those styles are non-static, they change if the document gets edited. At least the same internal style name gets used for portions of the text which has the same direct formatting (seems so). Therefore, if you've used “bold” and “italic” buttons consistently, you might want a tool which identifies “bold” and “italic” in the ODT and replaces them with a style name of your choice, where “bold” and “italic” would be a pretty bad idea and there's still the disadvantage that different meanings for the use of “bold” or “italic” can't be automatically identified, so additionally manual effort could be required to make it a quality ODT file. In HTML, I wouldn't introduce i or b at all, because they're contrary to quality HTML and bad for automatic processing.

So the options are to improve the ODT manually (since your document is a result of bad formatting habits, which are promoted by OpenOffice/LibreOffice just as in most other word processors), to use a helper to replace the direct formatting with styles if the direct formatting was used consistently and corresponds with a meaning, to introduce i and b to the output and make it bad quality, for which the ODT needs then to be interpreted, where it only gets transformed at the moment. My favorite is of course the first one, to get rid of the bad formatting habit, and probably getting front-ends like OpenOffice/LibreOffice to drop the promotion of such by replacing it with a style-based approach. There are other front-ends than OpenOffice/LibreOffice, which don't allow direct formatting at all, and OpenOffice/LibreOffice can be configured in a way that direct formatting is hidden from the user.

Sorry for your document, but in order to benefit from automatic processing, they need at least to be converted, and if direct formatting got applied indifferent, manual work is inevitable. I guess writer2xhtml has to spend quite some code on fixing the wrong usage of OpenOffice/LibreOffice, odt2html1 instead is based upon the right usage of OpenOffice/LibreOffice.

roger64 · 05-18-2014, 12:59 PM

Hi

Don't be sorry for my document, it was just a trial. But to tell you the truth, I feel a little confused: I did not think I could do something wrong doing this way.

So as to free OpenOffice of any charge, I enclosed here a very basic excerpt from a written essay. We find here:
- a paragraph style (1)
- three "character" styles, i.e. italic (2), small-caps (3) and superscript (4). I suppose the last three will be automatically processed as spans, with a class linking to their respective css characteristics.

For example: the italic span will have to respect all the proprieties of the enclosing paragraph style (among them the font-size) and modify the style from normal to italic

What am I supposed to do if I have to deal with this kind of text?
What is wrong if there is a straight conversion from the "italic" button to the character style "italic"? Or from the "superscript" one to the character style "superscript"?

skreutzer · 05-18-2014, 04:39 PM

No, you didn't do anything wrong at all ;-) Of course not, but as odt2html1 is only a small portion of a larger processing workflow, it might not lead to the results one expects from it, since its main purpose is to extract the raw text and structural information from an input ODT, while the processing back-end is supposed to take care of the visual representation in target formats. Yes, HTML is a target format, but the resulting HTML might get used as input for automated processing at any later time, and modern HTML separates structure from visual appearance as well. So the input file is expected to mark text by meaning, for which CSS classes can be used to define the actual visual appearance. This way, the visual appearance can be easily changed, or processing back-ends may react on it (building lists, filter stuff, extending marked portions with additional material etc.).

Regarding your example: processing software isn't able to differentiate between all superscript text in the entire document, which may indeed be of different types and therefore should be handled differently. Superscript might be used for footnote markers, superscript might be used in mathematical formulas, superscript might be used in measurements, and it would be impossible for software to recognize which one superscript text portions are supposed to be the footnote markers, if the software should generate one version with footnotes and one without. There might be paragraphs which are part of the ordinary text and there might be paragraphs which are remainder boxes. Without semantic markup, it could be difficult to identify them, especially if other boxes are used as well. Even in common use cases, not all paragraphs necessarily will look the same, so if each of them gets its corresponding type attached, they can easily be translated to whatever target format, while at the same time reducing layout mistakes by the author/formatter. Italic might be used for all kinds of things that need to be highlighted, which are of a completely different sort, be it emphasis, words in a foreign language or special names. Maybe some uses of italic should be changed to bold or should make up a automatically generated list, and if there's no other clue to determine to which type those uses belong to except their italic visual representation, which is also equally true for all other types, there won't be a way to identify the actual type.

Even if none of those benefits are of relevance, still semantic styles are a way to describe the elements of a document in an abstract way, so from a technical perspective it gets much easier to translate them from one format to another because meaning doesn't rely on its visual appearance, which is only highly recognizable by humans because humans can identify the context of a layout element while software can't. Additionally, layout concepts and description languages of two formats might be incompatible to each other while style names can be easily mapped to their equivalent or at least to whatever resembles the visual representation the closest. Up to some extend, OpenOffice/LibreOffice might even be used as a data structuring tool, and as writing software as well as word processors aren't intended to do a great deal of typesetting or format conversion by themselves, semantic markup is pretty much the best way to provide a bridge to sophisticated processing systems, which are almost always based upon semantic concepts.

04-05-2014, 10:54 AM	#32
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	Thank you very much for pointing out this issue! One of the most important goals of the stuff I do is to make sure that nobody ever gets excluded, because the basic freedom of unrestricted access plays an important role for the foundation of a free (digital) society. As Vimeo and too YouTube are censored in China, hopefully my own personal homepage isn't: http://www.skreutzer.de/digitales_pu...1_workflow.ogv http://www.skreutzer.de/digitales_pu...1_workflow.ogv I use Vimeo and YouTube primarily for the outsourcing of video files and for SEO. I could host the files myself, but I don't want to waste the resources if chances are low that people would find it there, so I only do it on request, if it makes sense. Please note that the current links don't provide any sort of anonymity or encryption, and since I don't convert the video files into some proprietary video format, you would need to play them with a video player that is capable of the most open video format Ogg Theora, for which freely licensed player software is commonly available. Another option is that I write short articles about the software, as blog posts or as tutorial/description, only containing text and screenshots, maybe even in English. This way, much less bandwidth and webspace would be wasted to convey the same information. Last edited by skreutzer; 04-05-2014 at 10:57 AM.

04-06-2014, 06:15 AM	#33
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi Thank you very much for providing an alternative link which did work from China (all in all, I needed two hours to download your video, i.e. 100 minutes for the first three minutes, but the remaining part in 20 minutes). Your 142 meg video file clearly demonstrates how, from a clean html file (book), your bash script has the ability to produce both a clean EPUB and a clean PDF, -including metadata thanks to a second bash script. You also show how you can add some text and a cover to the original file and produce again a complete EPUB and PDF. So congratulations for this first but quite important step. What will be the next step? When do you plan to release your script(s)? Last edited by roger64; 04-06-2014 at 06:40 AM.

04-07-2014, 02:08 PM	#34
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	Actually the toolset consists of several individual standalone programs written in Java (in order to be portable over computer architecture and operating system environments), and the non-portable shell scripts are currently used to glue them together. Those tools are already released, because I do open development in a github repository (hopefully not blocked in China): https://github.com/skreutzer/automat...tal_publishing I currently don't provide pre-built packages yet. If you're interested in trying those tools out, you have the option to build the executables yourself (an installed JavaSDK would be required), or I could release such (an installed JavaVM would be required). Further, the shell scripts might not be portable across all kinds of operating system environments, but you could still use the tools manually. For PDF generation, a LaTeX installation is required, since LaTeX is the only supported backend up to now (XSL-FO and iText may follow). The software is licensed under the GNU Affero General Public License version 3 or any later version, so you're permitted to do: Run the program without any restrictions. Modify and customize the program, as long as nobody else uses it except you. Distribute the program to other people, commercially or non-commercially, as long as you ensure that receivers can obtain the source code. Modify the program and distribute the modified version to other people, commercially or non-commercially, as long as you ensure that receivers can obtain the source code, including all modifications. Use the program, with or without modifications you made to it, together with extensions you wrote for yourself, which you can release under a license of your choice, but I want to strongly encourage you to use GNU AGPL 3 or later. Run the program on a server which is accessible through a network in order to provide an online service, as long as you ensure that receivers can obtain the source code of the program you're running (with or without modifications). You are prohibited from doing: Distributing executables without providing access to the corresponding source code. Removing copyright notices. Sublicensing files which were covered by the GNU AGPL or later to other people under a different license than GNU AGPL 3 or later. Running the program on a server which is accessible through a network without providing access to the corresponding source code you're running on that server. Distribute the program in devices which are set up in a way that a user can't run his modified version of the program in it, while the device is generally able to load and execute custom programs. This way, the license ensures that the software can be used to the greatest extend possible, only preventing you from actions which would violate the digital rights of other users. The license is intended to promote open collaboration and contribution to a free society, and I would really appreciate if you would improve the software, while each contribution, be it a small one or a big one, would be to the benefit of all users and developers altogether. Please note that the current features of the software aren't special at all, they're quite common for publishing houses and institutes that use proprietary software, and that there are already much more sophisticated solutions existing under a free license: Booktype by Sourcefabric Softcover Maybe it could be possible to interoperate with those systems, while my concept is a little different: the processing tools should also be usable outside a large and complex processing workflow, so that they can be used for small tasks as well and re-arranged with a higher flexibility for all kinds of custom and general tasks. Last edited by skreutzer; 04-07-2014 at 02:11 PM.

05-11-2014, 04:46 PM	#37
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi Thank you very much for providing these tools and videos. I'll be very keen to test your EPUB and PDF output from odt files. I have seen only from your site the 1.7. GNU version for html files and I suppose you'll upload them soon. You are progressing very fast and it's a pleasure to see somebody running again along paths that make me remember of writer2xhtml whose so promising development stopped two years ago for unexplained reasons. Last edited by roger64; 05-11-2014 at 05:03 PM.

05-12-2014, 04:54 PM	#38
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	I've just uploaded the 1.7 GNU package of the #46 commit. Please note that the odt2html tool will either just convert the raw, plain text and semantical structure information (style names) to HTML. This is quite a different approach than writer2xhtml, which tries to do a “visually lossless” conversion from ODT to HTML by resembling and translating ODT formatting to HTML formatting. My tools just ignore all ODT formatting and apply other styles to it. OTT (OpenOffice/LibreOffice document templates) can be used to pre-define such styles, and their actual implementation for HTML, EPUB or PDF can be changed by modifying the corresponding XSLTs (maybe later there could be helper tools for doing so). Regarding of the execution performance, I guess there are some factors that slow it down and can be optimized. For instance, the first invocation of the JavaVM takes a very long time if it isn't already running. Further, shell output is comparatively slow, even while Unix shells are pretty fast in general, but if output is dumped to a file, it could probably be even faster (especially pdflatex is very likely much faster, if it doesn't have to wait for writing to the terminal buffer). Additionally, for processing HTML, there are the DTDs involved, which get loaded into memory I guess (as DOM?), and with some adjustments, it might changed to plain XML processing, so it might be faster, too. I myself usually don't optimize for performance in my own programming, because if a program is slow, one can still wait a little longer, while the hardware of the ordinary user of today is incredible fast. On the other hand, if an environment has tight memory limits, there's no way to do anything about it. So a slower program may still produce results where a memory-greedy one doesn't, but I guess I'm far from being affected by such considerations, because there's neither any overhead of any kind nor excessive use of memory space involved in those small tools. In any case, it would be interesting to test the performance with the automatic generation of, let's say, hundred book projects all at once, when some kind of book management facility allows such runs on an entire collection of titles. After you mentioned writer2xhtml (which is part of the larger writer2latex package) to me, I looked into it a little and even built it from sources if I remember correctly (I might be wrong about that), but there are several issues why I preferred to write a new odt2html converter tool: at first, translating ODT visual appearance to HTML visual appearance would have been a huge overhead, if all that actually matters for automatization is the semantic markup. Then, writer2xhtml is largely based upon OpenOffice/LibreOffice code since it is intended as a plugin for it, so dependencies with the OpenOffice/LibreOffice code base would have been needed to maintain. And the third issue is that writer2xhtml is written as one large monolithical program, so it is hard to adjust for custom needs, requires a certain level of programming skills in order to be maintained and can't be used for other, similar tasks. My approach is more or less a primitive one compared to writer2xhtml, because every step from the source file to the target format is done by a small tool which does nothing else than just one single, defined and limited job, but this chain of individual tools can be combined in all kinds of ways, are easily adjustable and are reusable for other source/target formats. At least to the extent of the currently implemented automated workflow ;-) However, I still would like to see that the work put into writer2xhtml is of further use, but it might require a huge commitment to investigate the current code and change it in a way that makes it less dependent and more flexible. Anyway, writer2xhtml could be integrated into such automated processing workflows, I even experimented with the writer2xhtml output initially, but writer2xhtml in itself doesn't change much in terms of the original problem, which is the lack of semantic markup in the source document, so it would still translate to “garbage in, garbage out”, while all direct formatting still would needed to be removed from the writer2xhtml output so that only the raw text and semantical, structural information remains. Last edited by skreutzer; 05-12-2014 at 05:22 PM.

04-03-2014, 11:27 PM	#31
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi Thank you for this. German is not a problem. But vimeo seems to be a real one in China...

04-11-2014, 05:11 PM	#35
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	I've just started to provide pre-packaged downloads for the software: http://www.skreutzer.de/digitales_pu...ren/index.html Depending on the version of the installed JavaVM, choose the 1.6 or 1.7 package, depending on the operating system environment, choose the GNU (unix-like) or Windows package. After extracting the Zip archive, one will find run.sh or run.bat files, which usually can be executed with a double click. In case of questions, difficulties or suggestions, just post them here!

05-11-2014, 02:22 PM	#36
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	And now there's support for OpenOffice/LibreOffice as front-end for the publishing system, as ODT to HTML conversion is made compatible and got combined with EPUB and PDF generation. A document template is used to provide a set of pre-defined styles for semantic markup, which serves as some kind of “contract” or “agreement” between front-end and back-end, so that the back-end knows what to expect and the formatter at the front-end knows which styles he's supposed to apply to the text. Demonstration video (again, only in German language): http://vimeo.com/94766226 And for German speakers in China who are restricted from accessing Vimeo or YouTube: http://www.skreutzer.de/digitales_pu...l_workflow.ogv

05-13-2014, 11:30 AM	#40
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi Forgive me, I did a very unorthodox try using your latest 46 1.7.GNU commit. I took a big odt file (a whole book, a little under 300k) which I previously used as a source for an EPUB. I used your odt2html and got a sizable output.html file. I guessed I could open it after about 30 seconds but I got no warning information about when the processing exactly ended. Since I did not know how to follow, I created a new EPUB with the Calibre editor out of this file. Then I imported the two stylesheets from my original EPUB. Within the EPUB, I linked this output.html file to these two stylesheets. I checked with Calibre. It only complained that the text file was too big. I split it in two and I had a working EPUB which certainly I could read anywhere. I noticed of course some defects in the display. - the paragraph styles were nearly properly reported though some stylenames were interspersed with _20_ like Text_20_body instead of Textbody, or Ital_20_droite instead of Italdroite. This was easily corrected. Other paragraph names were properly transcribed (Quotation, Centrage,...). The main missing points are theses ones: - the titles were treated as plain paragraph (p class="Heading"). Intermediate h2 tags (chapters) disappeared. So I could not produce a usable toc.ncx - the small fry (I mean the i, sup, /br, ... tags) were all treated like a common span without any parameter, which means there is of course some transcription work to do in this area. All in all, I did not expect to get such a quick and workable result with an odt file of this size. Congratulations! Last edited by roger64; 05-13-2014 at 12:06 PM.

05-14-2014, 07:06 PM	#41
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	It seems this GitHub repository is an attempt to continue development of writer2latex. I just sent another e-mail to Henrik Just, hopefully this time with a response, because at least the SourceForge repository would benefit from a status update regarding the current situation. Yes, the “Ignore hard formatting” option of writer2xhtml might only retain the raw text and structural markup (with corresponding style names attached to it), which is quite advantageous for automatic processing based on the concept of semantic markup, which is most likely exactly the same as my odt2html does. For your 300k ODT input file, what exactly happened? How big was the output.html file? Couldn't you open it in an editor, browser or didn't odt2html1 quit (or just disappear after execution, if you've clicked on run.sh)? In any case, it sounds like you've invoked $/odt2html/odt2html1 by yourself on the terminal or via the $/odt2html/odt2html1/run.sh helper as standalone tool instead of the $/workflows/odt2epub1.sh, since your document isn't based upon the style names as defined in $/odt2html/templates/template1/template.ott and therefore $/workflows/odt2epub1.sh wouldn't know how to automatically transform the flat structure from ODT to an hierarchical structure, to split the input into several smaller HTMLs and to package it to EPUB by itself. If the run.sh was used, you could too look into the log.txt file which gets written to the same directory with each new execution. The names of the style classes correspond to the internal identifiers of the ODT, and since characters like space aren't in common use for identifiers, special characters are represented by their ASCII code in hex, separated by underscores, so it's quite easy to match the display names to their internal identifiers. One could translate them automatically back to their original display names, except for cases where a space was involved, because space is used as separator between individual CSS classes in a HTML class attribute. If this textual description isn't clear enough, I might just show you an actual example for it. If you use odt2html1 as standalone, it's indeed quite primitive and does nothing other than just converting the raw text and structural information from ODT to HTML, while no other transformation is performed at all. Therefore additional tools like html_flat2hierarchical1 or html2epub1_html_chapter.xsl are needed in order to get a more usable result. The reason that titles get treated as plain paragraphs (p class="Heading") is due to how ODT treats them, there's no actual info about the order of headings within the document body itself (maybe in the style definition, I haven't looked into it yet). I wonder about the mention of h2 tags, they're probably technically not in the ODT itself, are they? Or is it the use of a "chapter" style? In any case, the processing backend needs to make sense out of styles like "Heading", "Text_20_body" or "Ital_20_droite" and translate them to something meaningful. For the styles you've used in your own document, such replacements for EPUB generation would be similar to what prepare4hierarchical.xsl, html2epub1_html_part.xsl and html2epub1_html_chapter.xsl are for template1.ott of the $/odt2html/templates/template1 directory, as long as there's no tool nor GUI to do style matching between front-end and back-end yet. I think i, sup and br get ignored completely, and the remaining common span without any parameter is left over from what OpenOffice/LibreOffice puts as “extra” into the ODT, if I remember correctly. Additionally, if OpenOffice/LibreOffice wasn't just used to apply formatting to an already existing raw text but instead to write text in it initially, one can observe an incredible fragmentation of spans all over the place, and I don't know the reason why yet and don't tidy it yet in order to get a clean output. i, sup, br in almost any case implicate visual markup for italic, superscript and linebreak, regardless if they were (are they?) in the actual ODT, in HTML or in the EPUB (if allowed at all). Basically, for automated processing based upon semantic markup, you're not supposed to mark something as “italic” or “superscript” within your text, you should rather define styles like “emphasis” and “footnotemark” and then later define how they should visually be represented (while still OpenOffice/LibreOffice allows WYSIWYG text editing, even if you later get something that might look a little bit different, since the back-end applied other layouts to the input that was fed to the system). Even if you're not supposed to click the “italic” button, I haven't investigated yet which markup is going into the ODT file, and I'm pretty confident that even a hard i, sup and br could be translated to a style name, as style names would improve the quality of the output file, but again, if that button gets clicked and the goal is quality output that can be used for automated processing, “italic” (even when it is consistently used) isn't associated with any particular meaning and too isn't distinguishable from other italic text of a completely different sort, while both uses whould only share the same visual appearing, which of course isn't machine readable and therefore can't be recognized as separate by an automated processing system. Hopefully that's not too much text I've written now, but maybe we could experiment a little with use case examples and learn about the issues which aren't solved yet. In any case, thank you very much for your feedback, I'll try odt2html1 with larger documents within the next few days and look at the details a little more deeply, because up to the current version I was more concerned about the whole odt2epub workflow, in which odt2html is just the first of a set of steps, so more careful investigation is needed without doubt ;-)

05-15-2014, 02:36 AM	#42
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	hi Thanks for your reply and your interest. I am afraid you'll get no reply from Henrik Just. He just disappeared from the radar screen. I did try to contact him several times to no avail, including through OpenOffice where he had taken work about 30 months ago or even with the help of a Danish friend, by looking for him on the Danish phone directory. If you think it could be useful, I will provide you with a link to my source odt file, your converted output.html file, and to the final "hybrid" EPUB. I was just curious to see how a "real life" odt file of book size could be transcribed and, of course, I did not follow for this your recommended workflow. writer2xhtml provides a very useful "style mapping" for some formatting attributes like superscript, bold, italic. The most frequently used for me are: i, em, b, sup. There are also 14 others. Ignoring them would be quite inconvenient because I would have to invoke endless character style occurrences with OpenOffice to qualify them. That would be a real hassle and a disincentive. Last edited by roger64; 05-15-2014 at 02:46 AM.

05-17-2014, 04:48 AM	#43
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	It would be helpful to investigate the problem with the ODT file you've used, but on the other hand maybe there's a way to find out what went wrong without you disclosing your document to me. For instance, I could try odt2html1 on a large document I make up myself, because I just tried short ones yet. Of course you're free to use whatever workflow you like ;-) That's one of the goals of the project, that users can build their own workflows, so that all tools are modular building blocks and can be invoked separately. I just looked it up, ODT itself doesn't use i or b (the non-semantic, meaningless “bold” and “italic” buttons in OpenOffice/LibreOffice), internally those direct formatting visual apprearances on character level are implemented as styles. The names of those styles are non-static, they change if the document gets edited. At least the same internal style name gets used for portions of the text which has the same direct formatting (seems so). Therefore, if you've used “bold” and “italic” buttons consistently, you might want a tool which identifies “bold” and “italic” in the ODT and replaces them with a style name of your choice, where “bold” and “italic” would be a pretty bad idea and there's still the disadvantage that different meanings for the use of “bold” or “italic” can't be automatically identified, so additionally manual effort could be required to make it a quality ODT file. In HTML, I wouldn't introduce i or b at all, because they're contrary to quality HTML and bad for automatic processing. So the options are to improve the ODT manually (since your document is a result of bad formatting habits, which are promoted by OpenOffice/LibreOffice just as in most other word processors), to use a helper to replace the direct formatting with styles if the direct formatting was used consistently and corresponds with a meaning, to introduce i and b to the output and make it bad quality, for which the ODT needs then to be interpreted, where it only gets transformed at the moment. My favorite is of course the first one, to get rid of the bad formatting habit, and probably getting front-ends like OpenOffice/LibreOffice to drop the promotion of such by replacing it with a style-based approach. There are other front-ends than OpenOffice/LibreOffice, which don't allow direct formatting at all, and OpenOffice/LibreOffice can be configured in a way that direct formatting is hidden from the user. Sorry for your document, but in order to benefit from automatic processing, they need at least to be converted, and if direct formatting got applied indifferent, manual work is inevitable. I guess writer2xhtml has to spend quite some code on fixing the wrong usage of OpenOffice/LibreOffice, odt2html1 instead is based upon the right usage of OpenOffice/LibreOffice.

05-18-2014, 12:59 PM	#44
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi Don't be sorry for my document, it was just a trial. But to tell you the truth, I feel a little confused: I did not think I could do something wrong doing this way. So as to free OpenOffice of any charge, I enclosed here a very basic excerpt from a written essay. We find here: - a paragraph style (1) - three "character" styles, i.e. italic (2), small-caps (3) and superscript (4). I suppose the last three will be automatically processed as spans, with a class linking to their respective css characteristics. For example: the italic span will have to respect all the proprieties of the enclosing paragraph style (among them the font-size) and modify the style from normal to italic What am I supposed to do if I have to deal with this kind of text? What is wrong if there is a straight conversion from the "italic" button to the character style "italic"? Or from the "superscript" one to the character style "superscript"? Attached Thumbnails Last edited by roger64; 05-18-2014 at 01:06 PM.

05-18-2014, 04:39 PM	#45
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	No, you didn't do anything wrong at all ;-) Of course not, but as odt2html1 is only a small portion of a larger processing workflow, it might not lead to the results one expects from it, since its main purpose is to extract the raw text and structural information from an input ODT, while the processing back-end is supposed to take care of the visual representation in target formats. Yes, HTML is a target format, but the resulting HTML might get used as input for automated processing at any later time, and modern HTML separates structure from visual appearance as well. So the input file is expected to mark text by meaning, for which CSS classes can be used to define the actual visual appearance. This way, the visual appearance can be easily changed, or processing back-ends may react on it (building lists, filter stuff, extending marked portions with additional material etc.). Regarding your example: processing software isn't able to differentiate between all superscript text in the entire document, which may indeed be of different types and therefore should be handled differently. Superscript might be used for footnote markers, superscript might be used in mathematical formulas, superscript might be used in measurements, and it would be impossible for software to recognize which one superscript text portions are supposed to be the footnote markers, if the software should generate one version with footnotes and one without. There might be paragraphs which are part of the ordinary text and there might be paragraphs which are remainder boxes. Without semantic markup, it could be difficult to identify them, especially if other boxes are used as well. Even in common use cases, not all paragraphs necessarily will look the same, so if each of them gets its corresponding type attached, they can easily be translated to whatever target format, while at the same time reducing layout mistakes by the author/formatter. Italic might be used for all kinds of things that need to be highlighted, which are of a completely different sort, be it emphasis, words in a foreign language or special names. Maybe some uses of italic should be changed to bold or should make up a automatically generated list, and if there's no other clue to determine to which type those uses belong to except their italic visual representation, which is also equally true for all other types, there won't be a way to identify the actual type. Even if none of those benefits are of relevance, still semantic styles are a way to describe the elements of a document in an abstract way, so from a technical perspective it gets much easier to translate them from one format to another because meaning doesn't rely on its visual appearance, which is only highly recognizable by humans because humans can identify the context of a layout element while software can't. Additionally, layout concepts and description languages of two formats might be incompatible to each other while style names can be easily mapped to their equivalent or at least to whatever resembles the visual representation the closest. Up to some extend, OpenOffice/LibreOffice might even be used as a data structuring tool, and as writing software as well as word processors aren't intended to do a great deal of typesetting or format conversion by themselves, semantic markup is pretty much the best way to provide a bridge to sophisticated processing systems, which are almost always based upon semantic concepts.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Sigil as front end for automated XML based processing workflows?	skreutzer	Sigil	60	01-29-2014 12:10 PM
Workflows to use Calibre with iOS Apps: Good Reader-PDFs, Marvin-epub, Kindle-mobi?	crashnburn	Calibre	4	06-14-2013 04:49 PM
Bug in Kobo processing of epub files causing hang in "Processing content"	BensonBear	Kobo Reader	21	12-21-2012 05:47 AM
Sideloading + Annotations and Highlights Workflows?	jddunn	Kindle Fire	5	12-13-2012 03:59 AM
Other Non-Fiction Stallman, Richard M.: Free Software, Free Society, PDF v1.0, 4 March 2009	scottdw	Other Books	1	12-15-2011 03:02 PM

Advert

Advert