MobileRead Forums - View Single Post - Automated Processing Workflows as and with Free Software

skreutzer · 05-12-2014, 04:54 PM

I've just uploaded the 1.7 GNU package of the #46 commit. Please note that the odt2html tool will either just convert the raw, plain text and semantical structure information (style names) to HTML. This is quite a different approach than writer2xhtml, which tries to do a “visually lossless” conversion from ODT to HTML by resembling and translating ODT formatting to HTML formatting. My tools just ignore all ODT formatting and apply other styles to it. OTT (OpenOffice/LibreOffice document templates) can be used to pre-define such styles, and their actual implementation for HTML, EPUB or PDF can be changed by modifying the corresponding XSLTs (maybe later there could be helper tools for doing so).

Regarding of the execution performance, I guess there are some factors that slow it down and can be optimized. For instance, the first invocation of the JavaVM takes a very long time if it isn't already running. Further, shell output is comparatively slow, even while Unix shells are pretty fast in general, but if output is dumped to a file, it could probably be even faster (especially pdflatex is very likely much faster, if it doesn't have to wait for writing to the terminal buffer). Additionally, for processing HTML, there are the DTDs involved, which get loaded into memory I guess (as DOM?), and with some adjustments, it might changed to plain XML processing, so it might be faster, too. I myself usually don't optimize for performance in my own programming, because if a program is slow, one can still wait a little longer, while the hardware of the ordinary user of today is incredible fast. On the other hand, if an environment has tight memory limits, there's no way to do anything about it. So a slower program may still produce results where a memory-greedy one doesn't, but I guess I'm far from being affected by such considerations, because there's neither any overhead of any kind nor excessive use of memory space involved in those small tools. In any case, it would be interesting to test the performance with the automatic generation of, let's say, hundred book projects all at once, when some kind of book management facility allows such runs on an entire collection of titles.

After you mentioned writer2xhtml (which is part of the larger writer2latex package) to me, I looked into it a little and even built it from sources if I remember correctly (I might be wrong about that), but there are several issues why I preferred to write a new odt2html converter tool: at first, translating ODT visual appearance to HTML visual appearance would have been a huge overhead, if all that actually matters for automatization is the semantic markup. Then, writer2xhtml is largely based upon OpenOffice/LibreOffice code since it is intended as a plugin for it, so dependencies with the OpenOffice/LibreOffice code base would have been needed to maintain. And the third issue is that writer2xhtml is written as one large monolithical program, so it is hard to adjust for custom needs, requires a certain level of programming skills in order to be maintained and can't be used for other, similar tasks. My approach is more or less a primitive one compared to writer2xhtml, because every step from the source file to the target format is done by a small tool which does nothing else than just one single, defined and limited job, but this chain of individual tools can be combined in all kinds of ways, are easily adjustable and are reusable for other source/target formats. At least to the extent of the currently implemented automated workflow ;-)

However, I still would like to see that the work put into writer2xhtml is of further use, but it might require a huge commitment to investigate the current code and change it in a way that makes it less dependent and more flexible. Anyway, writer2xhtml could be integrated into such automated processing workflows, I even experimented with the writer2xhtml output initially, but writer2xhtml in itself doesn't change much in terms of the original problem, which is the lack of semantic markup in the source document, so it would still translate to “garbage in, garbage out”, while all direct formatting still would needed to be removed from the writer2xhtml output so that only the raw text and semantical, structural information remains.

05-12-2014, 04:54 PM	#38
skreutzer Software Developer Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	I've just uploaded the 1.7 GNU package of the #46 commit. Please note that the odt2html tool will either just convert the raw, plain text and semantical structure information (style names) to HTML. This is quite a different approach than writer2xhtml, which tries to do a “visually lossless” conversion from ODT to HTML by resembling and translating ODT formatting to HTML formatting. My tools just ignore all ODT formatting and apply other styles to it. OTT (OpenOffice/LibreOffice document templates) can be used to pre-define such styles, and their actual implementation for HTML, EPUB or PDF can be changed by modifying the corresponding XSLTs (maybe later there could be helper tools for doing so). Regarding of the execution performance, I guess there are some factors that slow it down and can be optimized. For instance, the first invocation of the JavaVM takes a very long time if it isn't already running. Further, shell output is comparatively slow, even while Unix shells are pretty fast in general, but if output is dumped to a file, it could probably be even faster (especially pdflatex is very likely much faster, if it doesn't have to wait for writing to the terminal buffer). Additionally, for processing HTML, there are the DTDs involved, which get loaded into memory I guess (as DOM?), and with some adjustments, it might changed to plain XML processing, so it might be faster, too. I myself usually don't optimize for performance in my own programming, because if a program is slow, one can still wait a little longer, while the hardware of the ordinary user of today is incredible fast. On the other hand, if an environment has tight memory limits, there's no way to do anything about it. So a slower program may still produce results where a memory-greedy one doesn't, but I guess I'm far from being affected by such considerations, because there's neither any overhead of any kind nor excessive use of memory space involved in those small tools. In any case, it would be interesting to test the performance with the automatic generation of, let's say, hundred book projects all at once, when some kind of book management facility allows such runs on an entire collection of titles. After you mentioned writer2xhtml (which is part of the larger writer2latex package) to me, I looked into it a little and even built it from sources if I remember correctly (I might be wrong about that), but there are several issues why I preferred to write a new odt2html converter tool: at first, translating ODT visual appearance to HTML visual appearance would have been a huge overhead, if all that actually matters for automatization is the semantic markup. Then, writer2xhtml is largely based upon OpenOffice/LibreOffice code since it is intended as a plugin for it, so dependencies with the OpenOffice/LibreOffice code base would have been needed to maintain. And the third issue is that writer2xhtml is written as one large monolithical program, so it is hard to adjust for custom needs, requires a certain level of programming skills in order to be maintained and can't be used for other, similar tasks. My approach is more or less a primitive one compared to writer2xhtml, because every step from the source file to the target format is done by a small tool which does nothing else than just one single, defined and limited job, but this chain of individual tools can be combined in all kinds of ways, are easily adjustable and are reusable for other source/target formats. At least to the extent of the currently implemented automated workflow ;-) However, I still would like to see that the work put into writer2xhtml is of further use, but it might require a huge commitment to investigate the current code and change it in a way that makes it less dependent and more flexible. Anyway, writer2xhtml could be integrated into such automated processing workflows, I even experimented with the writer2xhtml output initially, but writer2xhtml in itself doesn't change much in terms of the original problem, which is the lack of semantic markup in the source document, so it would still translate to “garbage in, garbage out”, while all direct formatting still would needed to be removed from the writer2xhtml output so that only the raw text and semantical, structural information remains. Last edited by skreutzer; 05-12-2014 at 05:22 PM.