MobileRead Forums - View Single Post - Automated Processing Workflows as and with Free Software

skreutzer · 05-14-2014, 07:06 PM

It seems this GitHub repository is an attempt to continue development of writer2latex. I just sent another e-mail to Henrik Just, hopefully this time with a response, because at least the SourceForge repository would benefit from a status update regarding the current situation.

Yes, the “Ignore hard formatting” option of writer2xhtml might only retain the raw text and structural markup (with corresponding style names attached to it), which is quite advantageous for automatic processing based on the concept of semantic markup, which is most likely exactly the same as my odt2html does.

For your 300k ODT input file, what exactly happened? How big was the output.html file? Couldn't you open it in an editor, browser or didn't odt2html1 quit (or just disappear after execution, if you've clicked on run.sh)? In any case, it sounds like you've invoked $/odt2html/odt2html1 by yourself on the terminal or via the $/odt2html/odt2html1/run.sh helper as standalone tool instead of the $/workflows/odt2epub1.sh, since your document isn't based upon the style names as defined in $/odt2html/templates/template1/template.ott and therefore $/workflows/odt2epub1.sh wouldn't know how to automatically transform the flat structure from ODT to an hierarchical structure, to split the input into several smaller HTMLs and to package it to EPUB by itself. If the run.sh was used, you could too look into the log.txt file which gets written to the same directory with each new execution. The names of the style classes correspond to the internal identifiers of the ODT, and since characters like space aren't in common use for identifiers, special characters are represented by their ASCII code in hex, separated by underscores, so it's quite easy to match the display names to their internal identifiers. One could translate them automatically back to their original display names, except for cases where a space was involved, because space is used as separator between individual CSS classes in a HTML class attribute. If this textual description isn't clear enough, I might just show you an actual example for it.

If you use odt2html1 as standalone, it's indeed quite primitive and does nothing other than just converting the raw text and structural information from ODT to HTML, while no other transformation is performed at all. Therefore additional tools like html_flat2hierarchical1 or html2epub1_html_chapter.xsl are needed in order to get a more usable result. The reason that titles get treated as plain paragraphs (p class="Heading") is due to how ODT treats them, there's no actual info about the order of headings within the document body itself (maybe in the style definition, I haven't looked into it yet). I wonder about the mention of h2 tags, they're probably technically not in the ODT itself, are they? Or is it the use of a "chapter" style? In any case, the processing backend needs to make sense out of styles like "Heading", "Text_20_body" or "Ital_20_droite" and translate them to something meaningful. For the styles you've used in your own document, such replacements for EPUB generation would be similar to what prepare4hierarchical.xsl, html2epub1_html_part.xsl and html2epub1_html_chapter.xsl are for template1.ott of the $/odt2html/templates/template1 directory, as long as there's no tool nor GUI to do style matching between front-end and back-end yet. I think i, sup and br get ignored completely, and the remaining common span without any parameter is left over from what OpenOffice/LibreOffice puts as “extra” into the ODT, if I remember correctly. Additionally, if OpenOffice/LibreOffice wasn't just used to apply formatting to an already existing raw text but instead to write text in it initially, one can observe an incredible fragmentation of spans all over the place, and I don't know the reason why yet and don't tidy it yet in order to get a clean output.

i, sup, br in almost any case implicate visual markup for italic, superscript and linebreak, regardless if they were (are they?) in the actual ODT, in HTML or in the EPUB (if allowed at all). Basically, for automated processing based upon semantic markup, you're not supposed to mark something as “italic” or “superscript” within your text, you should rather define styles like “emphasis” and “footnotemark” and then later define how they should visually be represented (while still OpenOffice/LibreOffice allows WYSIWYG text editing, even if you later get something that might look a little bit different, since the back-end applied other layouts to the input that was fed to the system). Even if you're not supposed to click the “italic” button, I haven't investigated yet which markup is going into the ODT file, and I'm pretty confident that even a hard i, sup and br could be translated to a style name, as style names would improve the quality of the output file, but again, if that button gets clicked and the goal is quality output that can be used for automated processing, “italic” (even when it is consistently used) isn't associated with any particular meaning and too isn't distinguishable from other italic text of a completely different sort, while both uses whould only share the same visual appearing, which of course isn't machine readable and therefore can't be recognized as separate by an automated processing system.

Hopefully that's not too much text I've written now, but maybe we could experiment a little with use case examples and learn about the issues which aren't solved yet. In any case, thank you very much for your feedback, I'll try odt2html1 with larger documents within the next few days and look at the details a little more deeply, because up to the current version I was more concerned about the whole odt2epub workflow, in which odt2html is just the first of a set of steps, so more careful investigation is needed without doubt ;-)

05-14-2014, 07:06 PM	#41
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	It seems this GitHub repository is an attempt to continue development of writer2latex. I just sent another e-mail to Henrik Just, hopefully this time with a response, because at least the SourceForge repository would benefit from a status update regarding the current situation. Yes, the “Ignore hard formatting” option of writer2xhtml might only retain the raw text and structural markup (with corresponding style names attached to it), which is quite advantageous for automatic processing based on the concept of semantic markup, which is most likely exactly the same as my odt2html does. For your 300k ODT input file, what exactly happened? How big was the output.html file? Couldn't you open it in an editor, browser or didn't odt2html1 quit (or just disappear after execution, if you've clicked on run.sh)? In any case, it sounds like you've invoked $/odt2html/odt2html1 by yourself on the terminal or via the $/odt2html/odt2html1/run.sh helper as standalone tool instead of the $/workflows/odt2epub1.sh, since your document isn't based upon the style names as defined in $/odt2html/templates/template1/template.ott and therefore $/workflows/odt2epub1.sh wouldn't know how to automatically transform the flat structure from ODT to an hierarchical structure, to split the input into several smaller HTMLs and to package it to EPUB by itself. If the run.sh was used, you could too look into the log.txt file which gets written to the same directory with each new execution. The names of the style classes correspond to the internal identifiers of the ODT, and since characters like space aren't in common use for identifiers, special characters are represented by their ASCII code in hex, separated by underscores, so it's quite easy to match the display names to their internal identifiers. One could translate them automatically back to their original display names, except for cases where a space was involved, because space is used as separator between individual CSS classes in a HTML class attribute. If this textual description isn't clear enough, I might just show you an actual example for it. If you use odt2html1 as standalone, it's indeed quite primitive and does nothing other than just converting the raw text and structural information from ODT to HTML, while no other transformation is performed at all. Therefore additional tools like html_flat2hierarchical1 or html2epub1_html_chapter.xsl are needed in order to get a more usable result. The reason that titles get treated as plain paragraphs (p class="Heading") is due to how ODT treats them, there's no actual info about the order of headings within the document body itself (maybe in the style definition, I haven't looked into it yet). I wonder about the mention of h2 tags, they're probably technically not in the ODT itself, are they? Or is it the use of a "chapter" style? In any case, the processing backend needs to make sense out of styles like "Heading", "Text_20_body" or "Ital_20_droite" and translate them to something meaningful. For the styles you've used in your own document, such replacements for EPUB generation would be similar to what prepare4hierarchical.xsl, html2epub1_html_part.xsl and html2epub1_html_chapter.xsl are for template1.ott of the $/odt2html/templates/template1 directory, as long as there's no tool nor GUI to do style matching between front-end and back-end yet. I think i, sup and br get ignored completely, and the remaining common span without any parameter is left over from what OpenOffice/LibreOffice puts as “extra” into the ODT, if I remember correctly. Additionally, if OpenOffice/LibreOffice wasn't just used to apply formatting to an already existing raw text but instead to write text in it initially, one can observe an incredible fragmentation of spans all over the place, and I don't know the reason why yet and don't tidy it yet in order to get a clean output. i, sup, br in almost any case implicate visual markup for italic, superscript and linebreak, regardless if they were (are they?) in the actual ODT, in HTML or in the EPUB (if allowed at all). Basically, for automated processing based upon semantic markup, you're not supposed to mark something as “italic” or “superscript” within your text, you should rather define styles like “emphasis” and “footnotemark” and then later define how they should visually be represented (while still OpenOffice/LibreOffice allows WYSIWYG text editing, even if you later get something that might look a little bit different, since the back-end applied other layouts to the input that was fed to the system). Even if you're not supposed to click the “italic” button, I haven't investigated yet which markup is going into the ODT file, and I'm pretty confident that even a hard i, sup and br could be translated to a style name, as style names would improve the quality of the output file, but again, if that button gets clicked and the goal is quality output that can be used for automated processing, “italic” (even when it is consistently used) isn't associated with any particular meaning and too isn't distinguishable from other italic text of a completely different sort, while both uses whould only share the same visual appearing, which of course isn't machine readable and therefore can't be recognized as separate by an automated processing system. Hopefully that's not too much text I've written now, but maybe we could experiment a little with use case examples and learn about the issues which aren't solved yet. In any case, thank you very much for your feedback, I'll try odt2html1 with larger documents within the next few days and look at the details a little more deeply, because up to the current version I was more concerned about the whole odt2epub workflow, in which odt2html is just the first of a set of steps, so more careful investigation is needed without doubt ;-)