MobileRead Forums - View Single Post - Automated Processing Workflows as and with Free Software

skreutzer · 05-18-2014, 04:39 PM

No, you didn't do anything wrong at all ;-) Of course not, but as odt2html1 is only a small portion of a larger processing workflow, it might not lead to the results one expects from it, since its main purpose is to extract the raw text and structural information from an input ODT, while the processing back-end is supposed to take care of the visual representation in target formats. Yes, HTML is a target format, but the resulting HTML might get used as input for automated processing at any later time, and modern HTML separates structure from visual appearance as well. So the input file is expected to mark text by meaning, for which CSS classes can be used to define the actual visual appearance. This way, the visual appearance can be easily changed, or processing back-ends may react on it (building lists, filter stuff, extending marked portions with additional material etc.).

Regarding your example: processing software isn't able to differentiate between all superscript text in the entire document, which may indeed be of different types and therefore should be handled differently. Superscript might be used for footnote markers, superscript might be used in mathematical formulas, superscript might be used in measurements, and it would be impossible for software to recognize which one superscript text portions are supposed to be the footnote markers, if the software should generate one version with footnotes and one without. There might be paragraphs which are part of the ordinary text and there might be paragraphs which are remainder boxes. Without semantic markup, it could be difficult to identify them, especially if other boxes are used as well. Even in common use cases, not all paragraphs necessarily will look the same, so if each of them gets its corresponding type attached, they can easily be translated to whatever target format, while at the same time reducing layout mistakes by the author/formatter. Italic might be used for all kinds of things that need to be highlighted, which are of a completely different sort, be it emphasis, words in a foreign language or special names. Maybe some uses of italic should be changed to bold or should make up a automatically generated list, and if there's no other clue to determine to which type those uses belong to except their italic visual representation, which is also equally true for all other types, there won't be a way to identify the actual type.

Even if none of those benefits are of relevance, still semantic styles are a way to describe the elements of a document in an abstract way, so from a technical perspective it gets much easier to translate them from one format to another because meaning doesn't rely on its visual appearance, which is only highly recognizable by humans because humans can identify the context of a layout element while software can't. Additionally, layout concepts and description languages of two formats might be incompatible to each other while style names can be easily mapped to their equivalent or at least to whatever resembles the visual representation the closest. Up to some extend, OpenOffice/LibreOffice might even be used as a data structuring tool, and as writing software as well as word processors aren't intended to do a great deal of typesetting or format conversion by themselves, semantic markup is pretty much the best way to provide a bridge to sophisticated processing systems, which are almost always based upon semantic concepts.

05-18-2014, 04:39 PM	#45
skreutzer Software Developer Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	No, you didn't do anything wrong at all ;-) Of course not, but as odt2html1 is only a small portion of a larger processing workflow, it might not lead to the results one expects from it, since its main purpose is to extract the raw text and structural information from an input ODT, while the processing back-end is supposed to take care of the visual representation in target formats. Yes, HTML is a target format, but the resulting HTML might get used as input for automated processing at any later time, and modern HTML separates structure from visual appearance as well. So the input file is expected to mark text by meaning, for which CSS classes can be used to define the actual visual appearance. This way, the visual appearance can be easily changed, or processing back-ends may react on it (building lists, filter stuff, extending marked portions with additional material etc.). Regarding your example: processing software isn't able to differentiate between all superscript text in the entire document, which may indeed be of different types and therefore should be handled differently. Superscript might be used for footnote markers, superscript might be used in mathematical formulas, superscript might be used in measurements, and it would be impossible for software to recognize which one superscript text portions are supposed to be the footnote markers, if the software should generate one version with footnotes and one without. There might be paragraphs which are part of the ordinary text and there might be paragraphs which are remainder boxes. Without semantic markup, it could be difficult to identify them, especially if other boxes are used as well. Even in common use cases, not all paragraphs necessarily will look the same, so if each of them gets its corresponding type attached, they can easily be translated to whatever target format, while at the same time reducing layout mistakes by the author/formatter. Italic might be used for all kinds of things that need to be highlighted, which are of a completely different sort, be it emphasis, words in a foreign language or special names. Maybe some uses of italic should be changed to bold or should make up a automatically generated list, and if there's no other clue to determine to which type those uses belong to except their italic visual representation, which is also equally true for all other types, there won't be a way to identify the actual type. Even if none of those benefits are of relevance, still semantic styles are a way to describe the elements of a document in an abstract way, so from a technical perspective it gets much easier to translate them from one format to another because meaning doesn't rely on its visual appearance, which is only highly recognizable by humans because humans can identify the context of a layout element while software can't. Additionally, layout concepts and description languages of two formats might be incompatible to each other while style names can be easily mapped to their equivalent or at least to whatever resembles the visual representation the closest. Up to some extend, OpenOffice/LibreOffice might even be used as a data structuring tool, and as writing software as well as word processors aren't intended to do a great deal of typesetting or format conversion by themselves, semantic markup is pretty much the best way to provide a bridge to sophisticated processing systems, which are almost always based upon semantic concepts.