MobileRead Forums - View Single Post - What "Cleaning Up" Do Project Gutenberg Texts Need [closed]

GregS · 11-09-2007, 07:25 AM

We may have a misunderstanding here, as I said I have only skimmed the thread, I know nothing whatsoever of the system you are suggesting, I have no opinion on it - I won't express an opinion until I am acquainted with ZML. I have had a goodish look at epub, that uis why I mentioned it.

Quote:

Originally Posted by bowerbird

greg said:
> I would propose that the Gutenberg problem
> does not lie in marking up for ebooks, but
> rather a markup that allows easy translation
> to things like epub (a very good move).

well, gee, that would be _nice_.

but the problem is that, in order to get .epub,
you _do_ have to do markup. quite a lot of it.
epub is xhtml/css underneath. (and not far...)

so yeah, it would certainly be lovely if we could
jump directly to epub without any markup, but
it's not really possible.

(you could also do what hadrien does at feedbooks,
which is to put the book into a structured database,
and _then_ churn out the .epub. he's doing markup,
he's just doing it another kind of way, via database.)

The reason I am suggesting this particular approach to large literature repositories has nothing to do with epublishing or readers per se, though they have a natural place in any such digital library.

It is all about the text, how it is used is all about translation. The primary thing is that the text be properly structured for the widest possible uses now and in the future.

Quote:

Originally Posted by bowerbird

> It is matter of finding a light markup that can be
> transformed coherently and consistently into
> heavy markup, they may include voice markup,
> reference markup, and complete structural markup,
> that is potentially well beyond what
> any present reader can handle.

so now you want light-markup as a middleman.
at first glance, that's an appealing position too...
you avoid the high costs of doing much markup,
but get the benefits that heavy markup "promises".

and once again, it would be nice if you could get it.

but you can't...

well -- to be completely frank -- you kind of can...

my routines can turn a light-markup file _into_
a heavy-markup file, and do a fairly good job...

but let me tell you why i think that's a dead-end.

consider the whole set of routines that will successfully
convert the light-markup file into a heavy-markup file,
which is then input to another app (call it "program p")
for "purposes of presentation" (whatever form it takes).

_instead_, put that set of routines right in "program p",
so it inputs the light-markup file, does the conversion
_itself_, and then goes on to act on the converted data.

that's better, isn't it? you didn't have to convert it yourself,
because "program p" did it for you. you avoided the mess
of the intermediate file. (because, really, were you going to
keep both the light-markup _and_ the heavy-markup files?
because that's just a bunch of unnecessary file-overhead.)

I again I would have to look carefully at ZML before offering any kind of opinion on what you say. I don't understand how light markup can be converted into heavy as I see most markup (especially heavy) to be a human interpretation of the text, helped by programs but not within their ken to accurately create.

I don't understand the intermediate file thing. what I was suggesting was a standard very light (ultralight) use of TEI because in terms of Literature it is the most developed markup in existence. As editions of the same text are made of the source text, more markup for different purposes is added to it.

A fully marked-up TEI text is a huge amount of element tags in proportion to the text, at in Voice Synthesis tagging (TEI 5) and the thing is almost all tags.

There is no electronic efficiency in this, it is most inefficient. 99% of the element tags and attributes are not needed for any particular use, for ebook reading (against the concept of a fully marked-up TEI text) only a tiny proportion of the markup is of any use at all.

The virtue is that they are just tags and can be easily filtered for particular purposes. What is more the text becomes a multi-use resource, which is my point (databasing cannot do this).

However, with TEI it is possible to reduce it to just a handfull of tags, just enough in fact to translate into something as simple as epub, or for that matter ZML. Moreover, translating into PDF for printing etc.,. while not trivial at this stage poses no insurmountable problem.

The idea is not applicable to people selling ebooks, or making them. However, Gutenberg is much more than this, potentially it and others like it are a new Alexandrian Library. And that requires a scholarly approach to how to keep the texts in their most useful form for all sorts of predictable and unpredictable uses. If we are talking of marked-up text for a purpose like that TEI is it (the only system developed enough to thoroughly markup literature from manuscripts, scientific articles and novels, plays, corpus collections, dictionaries etc.,.).

I am not however talking about some mammoth operation to apply TEI, just the marking out of a simple cut down version compatible with building more into the markup as time goes by.

Making no reference to ZML, but to epub which I have a sleight acquaintance, it could well be the model of what such a cut down version should be, a stage one markup could well be nearly a one to one conversion, simply changing the element and attribute names and filtering out anything else.

I am saying this approach is the most suitable for storing literature as a long term cultural asset. Not that it helps in any way eink readers or anything similar (catering for their use, a thousand times yes, but not designing the text markup for simply reading them on current devices).

Quote:

Originally Posted by bowerbird

> I would suggest, that TEI (text Encoding Initiative)
> is the only candidate.

oh sheesh! you want to jump _directly_
to the heaviest of the heavy, don't you? :+)

good luck with that. that's been the plan of
the technoid faction over at p.g. for... well,
going on 6 years now... going _nowhere_...

-bowerbird

I am no techniod, but an academically inclined teacher, eager to have good Literature made available in a flexible and future proofed, form. At least at the start of the thread this seem to be the main concern, how to adapt Gutenberg's resources for this Second digital revolution.

I am the first to agree that coding fulling in TEI is a nightmare, that the editors for this are nowhere developed enough, and that trying to apply this form of markup by non-scholars is a recipe for disaster as it stands.

However, the system which is nearly fully developed TEI 5 is a huge improvement and far more extensive than anything before it, and solves the dire problem of multiple different and incompatible markup schemes being applied to the same text and in the same file.

There are ways it can be used in a very limited and cut-down easy to apply fashion, in fact this could be done quickly by just a handful of people familiar with TEI and the needs of such thing as ebook reading. There are also ways of using markup externally to the text file (a structural markup stylesheet).

The real virtue of TEI is its thoroughly developed element structure, and that it has been designed to cope with the most diverse textual material to a scholarly level.

I think we might therefore be at cross purposes. I will have a look at ZML when I get the chance, it is possible it might just be the thing, but I find it incredibly hard to imagine a simple solution to such a complex problem of marking up literature, storing it, and developing its structural analysis and use by applying more and more markup overtime.

I have a little experience in dealing with quality voice synthesis, which in the near future may well be put into handheld readers. I can say with some authority that TEI's markup solution is superior to any other approach I have come across (on a number of grounds).

Voice direction is not compatible to textual structure as one might assume. Speakers sometimes speak together (two different structures combined), voices may blend behind, sometimes a SFX may play behind any number of speakers, dialogues. What I am saying is that the nested nature of markup is necessary, yet adding voice markup can violate that. TEI has a compatible solution, I know of no other system that can mix the two.

Besides which markup a huge potential variety of text sources, from ancient epigraphs, parallel commentary, embedding translations of odd terms, musical notation, dramatic pieces etc.,. these things need a very elaborate system to do justice to the content.

Several basic standard types of severely reduced TEI markup would be an ideal solution for Gutenberg - HTML and epub, just cannot cut the mustard in the long term, nor should we expect it to.