> Though be grateful for the fact that
> the text is out there at all and
> you don't have to OCR it yourself!
well heck, i'm _extremely_ grateful for project gutenberg.
as the forerunner of _all_ the net collaboration projects,
including wikipedia, it has _tremendous_ value to me...
so that's first and foremost.
having said that, however, o.c.r. ain't difficult these days.
scanning (and all that it entails, including rounding up
a hard-copy to scan) is the hardest part of the equation,
and google (and others) are taking care of all that hassle.
but yeah, as i said, correcting that o.c.r. is where all the
p.g. e-texts will come in handy, in the next cyberlibrary.
> Also you can see the issue from the point of view
> of the original transcribers as well. For example
> I've just been restoring the italics in the PG text of Nostromo,
> and very often the transcriber users initial caps for a word
> that was originally in italics - probably a more elegant and
> reader-friendly solution than using forward slashes for italicized words.
well, maybe. the problem is, though, that it's an ambiguous coding,
so it becomes impossible to restore things to their original state...
a forward-slashes method -- while maybe not "reader-friendly" --
would have at least been unambiguous enough to easily un-do...
> I don't understand why you would need a new mark-up,
> correctly used, html mark-up [eg h1 for the book title
> h2 for the part or section title and h3 for the chapter]
> gives you all the semantic information you need.
well, the problem with .html is that its obtrusive markup makes it
hard to maintain (e.g., correct, edit, compare, update, re-mix, etc.),
as well as to read in the underlying "master" format.
do a view-source on this page:
then compare that source-html to this page:
particularly since the .zml file actually _generated_ the .html one,
i think it's pretty easy to tell which file would be easier to maintain,
especially with a library of thousands of e-texts (let alone millions).
and then of course when you ratchet up the difficulty to the level of
the .epub format, where each e-text file needs accompanying files,
you're just asking for trouble. in my view, complex formats like that
are simply the old-guard dinosaur publishing-houses attempting to
raise the cost-of-entry for us "amateur" newbies, whose new capacity
for self-publishing will totally and completely subvert their business.
they're attempting to find a way to maintain their status as middlemen,
so they can continue to siphon off a good percentage of the revenue...
> Personally I believe that plain vanilla html
> (or its baby siblings markdown, textile etc) is the new ascii.
markdown and textfile are both light-markup systems,
and thus of the same type as my zen markup language.
(except my z.m.l. is even less obtrusive than they are.)
but yes, this is the way of the future. authors want to write,
not be caught up in unnecessary complexities of file-formats.