MobileRead Forums - View Single Post - What "Cleaning Up" Do Project Gutenberg Texts Need [closed]

ebookie · 11-04-2007, 09:21 PM

I'm hesitant to join into this discussion

but since I've been thinking about some of these issues as well I figured I would put in a couple of comments.

First, the difference between semantics and presentation. So HTML (as a DTD of SGML) mixed these two with the notion that you were presenting documents in a browser of variable size. There is some notion of semantics (like H1 is a top level heading) and some notion of presentation (like B is boldface) and not a clear line between them. If the Project Gutenberg (PG) texts could be converted into something that identified just the semantics around the text then one could build formatter/presenters to "present" it on an electronic book.

Bowerbird's attempts are notable in that they attempt to embed semantics into a file as transparently as possible (which is a good goal if you might find yourself reading the file directly) but that feature makes it pretty challenging to screen automatically for errors. (For example if a bit flip causes the number 'M' (one bit different and <CR> in ASCII) to appear in one of the 5 lines between headers what does it do?) Does that screw up the presentation?

Now there is a standard way to solve this issue, its by using the stuff between SGML (very complicated) and HTML (very confused) called XML. Not XHTML but just XML. If the semantics of the book are automatically added into the PG text as XML tag pairs then three benefits will result:

1) An XML schema checker can validate that the semantics
are valid.
2) An XSLT style sheet can easily, and on the fly, convert the book
to ASCII, PostScript, HTML, Etc.
3) New style sheets can leverage existing annotated books to support
new formats.

Given the existing support for parsing and processing XML it would be straightforward (although perhaps not easy), to create a copy editing tool which sucked in a book, added its best guess at what the semantics were (and there is great work to leverage from the ZML work here) and then generate an annotated result. One might hope that all copy editors/proof readers can agree that something "Is a heading" without having to agree on how headings should be presented, or treated in the book presentation.

--Chuck

11-04-2007, 09:21 PM	#65
ebookie Entrepreneur Posts: 36 Karma: 10 Join Date: Oct 2007 Location: California Device: Iliad v2	I'm hesitant to join into this discussion but since I've been thinking about some of these issues as well I figured I would put in a couple of comments. First, the difference between semantics and presentation. So HTML (as a DTD of SGML) mixed these two with the notion that you were presenting documents in a browser of variable size. There is some notion of semantics (like H1 is a top level heading) and some notion of presentation (like B is boldface) and not a clear line between them. If the Project Gutenberg (PG) texts could be converted into something that identified just the semantics around the text then one could build formatter/presenters to "present" it on an electronic book. Bowerbird's attempts are notable in that they attempt to embed semantics into a file as transparently as possible (which is a good goal if you might find yourself reading the file directly) but that feature makes it pretty challenging to screen automatically for errors. (For example if a bit flip causes the number 'M' (one bit different and <CR> in ASCII) to appear in one of the 5 lines between headers what does it do?) Does that screw up the presentation? Now there is a standard way to solve this issue, its by using the stuff between SGML (very complicated) and HTML (very confused) called XML. Not XHTML but just XML. If the semantics of the book are automatically added into the PG text as XML tag pairs then three benefits will result: 1) An XML schema checker can validate that the semantics are valid. 2) An XSLT style sheet can easily, and on the fly, convert the book to ASCII, PostScript, HTML, Etc. 3) New style sheets can leverage existing annotated books to support new formats. Given the existing support for parsing and processing XML it would be straightforward (although perhaps not easy), to create a copy editing tool which sucked in a book, added its best guess at what the semantics were (and there is great work to leverage from the ZML work here) and then generate an annotated result. One might hope that all copy editors/proof readers can agree that something "Is a heading" without having to agree on how headings should be presented, or treated in the book presentation. --Chuck