MobileRead Forums - View Single Post - Sigil as front end for automated XML based processing workflows?

skreutzer · 01-27-2014, 05:15 PM

I just successfully built the writer2latex package from the sources (latest version of the public repository) on gNewSense 3.0 with OpenJDK 1.6 for OpenOffice 3.2.1 and did a first conversion from ODT to valid XHTML. The result looks slightly cleaner than the adjustments I did by hand in my demo video (I used the same input ODT file). So there isn't much writer2latex saved me, but I guess now I may benefit from all of the command line options, so that the conversion results might directly feed into an automated processing workflow, where I will now also try ODT to EPUB and LaTeX conversion. For the described operating system environment, I probably may provide some kind of support, also small fixes, if needed. However, it would be most interesting to find out on how dependent the writer2latex is on OpenOffice libraries and API - hopefully not much, so it could even be modified to take other input files than ODT, like XHTML or custom XML, too. For the output, the writer2latex package tries to represent the visual appearance of the input ODT file as closely as possible, which quite isn't the best concept for automated processing, since there may be several output formats where there's no sensible way to represent the OpenOffice WYSIWYG appearance even with approximately similarity. Additionally, in order to be flexible, it is required to be able to replace portions of the data with custom content or styles. It might be too early to estimate how useful the package is and if/how it could be changed, but it already works out of the box, and I might use it to build a fully automated workflow in order to have a first brief demonstration for real-world application.

In case such a solution could be interesting for some of you, I guess it would be better to fork this conversation about writer2latex specific issues and updates on my experiments with it.

Please note that I'll refer to the standalone tools of the writer2latex package instead of the OpenOffice extension, since manually clicking on things isn't a solution for processing in bulk, and the need of starting OpenOffice is already eliminated if an ODT file is provided.

Update: Obviously, the writer2latex package makes extensive usage of configuration files, which is ideal for automated processing. The user manual describes almost all the features which I would want for the task, so my own solution would have looked like quite similar (but probably without the goal to preserve the OpenOffice WYSIWYG as close as possible), and by integrating writer2latex development time from 2002-2012 might be saved. There's still the issue of dependency on OpenOffice API libraries, which might be OK or solved by replacing the dependencies with ordinary ODT XML reading. The documentation doesn't mention EPUB output, but the XHTML conversion allows the removal of direct formatting, style name matching and the insertion of custom stylesheet references. Even if EPUB conversion doesn't support such features, the XHTML output itself would be sufficient to be integrated to a good EPUB, or the EPUB converter might be extended to provide the same features. If for the input side the dependency on OpenOffice could be removed from the standalone tool, and also (for instance) XHTML or plain text added as input formats (or by generating an ODT from XHTML, plain text, RTF as first step of the processing workflow), the package would be an invaluable part of automated XML document processing. Missing output formats and customization features could be added, predesigned configuration files provided, and GUI tools developed in order to enable ordinary authors to edit the configuration files. The future efforts might be to limit the things the writer2latex package tries to do itself, and to combine it with specialized transformations which will take care of complex processing steps in a more readable and customizable way. I'll continue to investigate.

01-27-2014, 05:15 PM	#57
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	I just successfully built the writer2latex package from the sources (latest version of the public repository) on gNewSense 3.0 with OpenJDK 1.6 for OpenOffice 3.2.1 and did a first conversion from ODT to valid XHTML. The result looks slightly cleaner than the adjustments I did by hand in my demo video (I used the same input ODT file). So there isn't much writer2latex saved me, but I guess now I may benefit from all of the command line options, so that the conversion results might directly feed into an automated processing workflow, where I will now also try ODT to EPUB and LaTeX conversion. For the described operating system environment, I probably may provide some kind of support, also small fixes, if needed. However, it would be most interesting to find out on how dependent the writer2latex is on OpenOffice libraries and API - hopefully not much, so it could even be modified to take other input files than ODT, like XHTML or custom XML, too. For the output, the writer2latex package tries to represent the visual appearance of the input ODT file as closely as possible, which quite isn't the best concept for automated processing, since there may be several output formats where there's no sensible way to represent the OpenOffice WYSIWYG appearance even with approximately similarity. Additionally, in order to be flexible, it is required to be able to replace portions of the data with custom content or styles. It might be too early to estimate how useful the package is and if/how it could be changed, but it already works out of the box, and I might use it to build a fully automated workflow in order to have a first brief demonstration for real-world application. In case such a solution could be interesting for some of you, I guess it would be better to fork this conversation about writer2latex specific issues and updates on my experiments with it. Please note that I'll refer to the standalone tools of the writer2latex package instead of the OpenOffice extension, since manually clicking on things isn't a solution for processing in bulk, and the need of starting OpenOffice is already eliminated if an ODT file is provided. Update: Obviously, the writer2latex package makes extensive usage of configuration files, which is ideal for automated processing. The user manual describes almost all the features which I would want for the task, so my own solution would have looked like quite similar (but probably without the goal to preserve the OpenOffice WYSIWYG as close as possible), and by integrating writer2latex development time from 2002-2012 might be saved. There's still the issue of dependency on OpenOffice API libraries, which might be OK or solved by replacing the dependencies with ordinary ODT XML reading. The documentation doesn't mention EPUB output, but the XHTML conversion allows the removal of direct formatting, style name matching and the insertion of custom stylesheet references. Even if EPUB conversion doesn't support such features, the XHTML output itself would be sufficient to be integrated to a good EPUB, or the EPUB converter might be extended to provide the same features. If for the input side the dependency on OpenOffice could be removed from the standalone tool, and also (for instance) XHTML or plain text added as input formats (or by generating an ODT from XHTML, plain text, RTF as first step of the processing workflow), the package would be an invaluable part of automated XML document processing. Missing output formats and customization features could be added, predesigned configuration files provided, and GUI tools developed in order to enable ordinary authors to edit the configuration files. The future efforts might be to limit the things the writer2latex package tries to do itself, and to combine it with specialized transformations which will take care of complex processing steps in a more readable and customizable way. I'll continue to investigate. Last edited by skreutzer; 01-27-2014 at 06:32 PM.