What "Cleaning Up" Do Project Gutenberg Texts Need [closed] - Page 9

kovidgoyal · 11-08-2007, 07:26 PM

Sigh again with the accusations. zml does not support custom formatting of individual elements. When I make that statement it means that the *markup language* zml does not have support for specifying custom formatting of individual elements. Reader software will have support for individual formatting of *structural* elements not individual formatting of *arbitrary* elements.

Please read that paragraph three times before replying.

bowerbird · 11-08-2007, 07:44 PM

greg said:
> I would propose that the Gutenberg problem
> does not lie in marking up for ebooks, but
> rather a markup that allows easy translation
> to things like epub (a very good move).

well, gee, that would be _nice_.

but the problem is that, in order to get .epub,
you _do_ have to do markup. quite a lot of it.
epub is xhtml/css underneath. (and not far...)

so yeah, it would certainly be lovely if we could
jump directly to epub without any markup, but
it's not really possible.

(you could also do what hadrien does at feedbooks,
which is to put the book into a structured database,
and _then_ churn out the .epub. he's doing markup,
he's just doing it another kind of way, via database.)

> It is matter of finding a light markup that can be
> transformed coherently and consistently into
> heavy markup, they may include voice markup,
> reference markup, and complete structural markup,
> that is potentially well beyond what
> any present reader can handle.

so now you want light-markup as a middleman.
at first glance, that's an appealing position too...
you avoid the high costs of doing much markup,
but get the benefits that heavy markup "promises".

and once again, it would be nice if you could get it.

but you can't...

well -- to be completely frank -- you kind of can...

my routines can turn a light-markup file _into_
a heavy-markup file, and do a fairly good job...

but let me tell you why i think that's a dead-end.

consider the whole set of routines that will successfully
convert the light-markup file into a heavy-markup file,
which is then input to another app (call it "program p")
for "purposes of presentation" (whatever form it takes).

_instead_, put that set of routines right in "program p",
so it inputs the light-markup file, does the conversion
_itself_, and then goes on to act on the converted data.

that's better, isn't it? you didn't have to convert it yourself,
because "program p" did it for you. you avoided the mess
of the intermediate file. (because, really, were you going to
keep both the light-markup _and_ the heavy-markup files?
because that's just a bunch of unnecessary file-overhead.)

> I would suggest, that TEI (text Encoding Initiative)
> is the only candidate.

oh sheesh! you want to jump _directly_
to the heaviest of the heavy, don't you? :+)

good luck with that. that's been the plan of
the technoid faction over at p.g. for... well,
going on 6 years now... going _nowhere_...

-bowerbird

bowerbird · 11-08-2007, 07:46 PM

i could read it 100 times and i will not reply,
because i've gotten off that merry-go-round.

-bowerbird

bowerbird · 11-08-2007, 07:49 PM

rwood, i'm not sure where you're getting your "facts",
but i'm not interested in the fight you want to pick.

i'm not even gonna correct all the falsehoods you stated.

-bowerbird

kovidgoyal · 11-08-2007, 08:01 PM

Quote:

Originally Posted by bowerbird

i could read it 100 times and i will not reply,
because i've gotten off that merry-go-round.

-bowerbird

bye bye

bowerbird · 11-08-2007, 09:37 PM

good.

now, for anyone who wants to know _if_ z.m.l. _can_ do something
-- something specific, something they _need_ -- feel free to ask me,
and i'll be happy to tell you if it can, and how you would accomplish it.

there are lots of people who seemingly want to tell you what z.m.l.
can _not_ do, but i don't suggest you ask them, because they simply
don't know my system like i know it. which only makes sense, yeah?

there are lots of things that z.m.l. cannot do. if you are an author
who wants to dictate the font(s) used in your book, you can't do it.
you can't dictate the fontsize -- not even the _relative_ font-size --
or the color of any of the text, or background color(s), or margins,
or the leading, or the pagesize, none of it, absolutely _none_ of it.
can't even make _suggestions_ about the settings of those things...

so, you know, if you _need_ those things, z.m.l. isn't right for you.

because all of those variables are controlled _solely_ by the reader.

oh, well, for the _record_, i have _considered_ a mechanism whereby
the author could make "suggestions" about some of those dimensions,
but i haven't made the decision whether i will actually _implement_ it.
of course, the final say in the matter will always rest in _the_reader_.

that is -- just for anyone who has been mistaken about it all along --
they'll be controlled by the _human_being_ who is _reading_ the book,
who i call "the reader". (when i'm talking about the viewer-program,
i call it "the viewer-program", the "viewer-app", or just "the viewer".
but when i say "the reader", i'm talking about the breathing human...
and it's that breathing human -- the one who is absorbing the words --
who makes the decisions about presentational aspects of a z.m.l. text.

-bowerbird

DaleDe · 11-09-2007, 02:14 AM

Quote:

Originally Posted by bowerbird

good.

now, for anyone who wants to know _if_ z.m.l. _can_ do something
-- something specific, something they _need_ -- feel free to ask me,
and i'll be happy to tell you if it can, and how you would accomplish it.

there are lots of people who seemingly want to tell you what z.m.l.
can _not_ do, but i don't suggest you ask them, because they simply
don't know my system like i know it. which only makes sense, yeah?

there are lots of things that z.m.l. cannot do. if you are an author
who wants to dictate the font(s) used in your book, you can't do it.
you can't dictate the fontsize -- not even the _relative_ font-size --
or the color of any of the text, or background color(s), or margins,
or the leading, or the pagesize, none of it, absolutely _none_ of it.
can't even make _suggestions_ about the settings of those things...

so, you know, if you _need_ those things, z.m.l. isn't right for you.

because all of those variables are controlled _solely_ by the reader.

oh, well, for the _record_, i have _considered_ a mechanism whereby
the author could make "suggestions" about some of those dimensions,
but i haven't made the decision whether i will actually _implement_ it.
of course, the final say in the matter will always rest in _the_reader_.

that is -- just for anyone who has been mistaken about it all along --
they'll be controlled by the _human_being_ who is _reading_ the book,
who i call "the reader". (when i'm talking about the viewer-program,
i call it "the viewer-program", the "viewer-app", or just "the viewer".
but when i say "the reader", i'm talking about the breathing human...
and it's that breathing human -- the one who is absorbing the words --
who makes the decisions about presentational aspects of a z.m.l. text.

-bowerbird

While I applaud user choice there should be guidance in what the author intended. Bold, italics, font size and even swithing font can be a useful mechanism to let the user know they are now reading a letter, or a sign, or some other special effect that needs to be communicated.

I really like the way html started out back in version 3. It was great. The author hints about the weight and import of the data and the user controlled the presentation. No CSS where the author attempts to control everything and makes thing too complicated. What happened? (My web site is still built by hand with html.)

Well the source of the documents had to take over control of the documents and had to publish the page rather than present the data. Too bad, IMHO. And, for the user, the flashiness of the presentation overruled the accuracy or the content. I am amazed in the business community how much people with believe data presented in power point when they would challenge it on a type written page. I am in the minority it seems and you are even further from the main stream than I am it would seem.

Sorry, you post triggered a rant. I am better know.

Dale

bowerbird · 11-09-2007, 06:03 AM

dalede said:
> While I applaud user choice there should be
> guidance in what the author intended.
> Bold, italics, font size and even swithing font
> can be a useful mechanism to let the user know

bold and italics are indeed things that the author indicates in z.m.l.
bold is represented with *asterisks*, and italics with _underscores_.
(now you know why i'm always using the underscores in posts.)

of course, the user can exercise an option to change the way that
bold and italics are _rendered_. *bold* might be rendered in red,
and _italics_ might be rendered instead with green underlined text.
the author can also use other characters to indicate special marking;
> $this$might$be$the$signal$to$indicate$computer$cod e.$
> `and`this`might`indicate`a`monospaced`font`should` be`used.`

> can be a useful mechanism to let the user know
> they are now reading a letter, or a sign, or some
> other special effect that needs to be communicated.

a letter or a sign would be set off specifically as a _block_.

here's an example, from the first page of p.g. e-text #22589:

Quote:

The sign said:
~tab~~tab~ JUBILATION, U.S.A.!! ~tab~~tab~
~tab~~tab~ The doggondest, cheeriest ~tab~~tab~
~tab~~tab~ little town in America! ~tab~~tab~

The two aliens smiled at each other. Unaccustomed to oral conversation,
they exchanged thoughts.

> http://www.gutenberg.org/files/22589...-h/22589-h.htm

the "~tab~" thingee indicates a tab, just so you can see it there.

in z.m.l., if you have two tabs at the start of a line, and two at the end,
it means that line is supposed to be centered. further, when you have
several such successive lines, it means you've got a _block_... voila!

z.m.l. doesn't know what _kind_ of block it is, and it doesn't really care.

i've built routines that look for certain words in the text around a block,
to ascertain what _kind_ of block, words like "invitation" and "sign" and
"letter" and "note" and "warning" and "figure" and "table" and so on...

and the routines actually work very well, which _amazed_ me at first,
until i realized that authors will _generally_ inform their readers about
something out of the ordinary like this. it's not merely the typography
that indicates what it is, it's the author explicitly _telling_ the reader.
just like the author did in the example above. check it out for yourself,
across a number of books, and you will see that it's actually the case...

so i have no plans to include these routines in my viewer-app presently.
if later on, there arises some _need_ for the program to _identify_ certain
types of blocks, i'll put it in. but for the time being, i don't see that need.

but yeah, good point, and i think i've got that covered well enough...

-bowerbird

GregS · 11-09-2007, 07:25 AM

We may have a misunderstanding here, as I said I have only skimmed the thread, I know nothing whatsoever of the system you are suggesting, I have no opinion on it - I won't express an opinion until I am acquainted with ZML. I have had a goodish look at epub, that uis why I mentioned it.

Quote:

Originally Posted by bowerbird

greg said:
> I would propose that the Gutenberg problem
> does not lie in marking up for ebooks, but
> rather a markup that allows easy translation
> to things like epub (a very good move).

well, gee, that would be _nice_.

but the problem is that, in order to get .epub,
you _do_ have to do markup. quite a lot of it.
epub is xhtml/css underneath. (and not far...)

so yeah, it would certainly be lovely if we could
jump directly to epub without any markup, but
it's not really possible.

(you could also do what hadrien does at feedbooks,
which is to put the book into a structured database,
and _then_ churn out the .epub. he's doing markup,
he's just doing it another kind of way, via database.)

The reason I am suggesting this particular approach to large literature repositories has nothing to do with epublishing or readers per se, though they have a natural place in any such digital library.

It is all about the text, how it is used is all about translation. The primary thing is that the text be properly structured for the widest possible uses now and in the future.

Quote:

Originally Posted by bowerbird

> It is matter of finding a light markup that can be
> transformed coherently and consistently into
> heavy markup, they may include voice markup,
> reference markup, and complete structural markup,
> that is potentially well beyond what
> any present reader can handle.

so now you want light-markup as a middleman.
at first glance, that's an appealing position too...
you avoid the high costs of doing much markup,
but get the benefits that heavy markup "promises".

and once again, it would be nice if you could get it.

but you can't...

well -- to be completely frank -- you kind of can...

my routines can turn a light-markup file _into_
a heavy-markup file, and do a fairly good job...

but let me tell you why i think that's a dead-end.

consider the whole set of routines that will successfully
convert the light-markup file into a heavy-markup file,
which is then input to another app (call it "program p")
for "purposes of presentation" (whatever form it takes).

_instead_, put that set of routines right in "program p",
so it inputs the light-markup file, does the conversion
_itself_, and then goes on to act on the converted data.

that's better, isn't it? you didn't have to convert it yourself,
because "program p" did it for you. you avoided the mess
of the intermediate file. (because, really, were you going to
keep both the light-markup _and_ the heavy-markup files?
because that's just a bunch of unnecessary file-overhead.)

I again I would have to look carefully at ZML before offering any kind of opinion on what you say. I don't understand how light markup can be converted into heavy as I see most markup (especially heavy) to be a human interpretation of the text, helped by programs but not within their ken to accurately create.

I don't understand the intermediate file thing. what I was suggesting was a standard very light (ultralight) use of TEI because in terms of Literature it is the most developed markup in existence. As editions of the same text are made of the source text, more markup for different purposes is added to it.

A fully marked-up TEI text is a huge amount of element tags in proportion to the text, at in Voice Synthesis tagging (TEI 5) and the thing is almost all tags.

There is no electronic efficiency in this, it is most inefficient. 99% of the element tags and attributes are not needed for any particular use, for ebook reading (against the concept of a fully marked-up TEI text) only a tiny proportion of the markup is of any use at all.

The virtue is that they are just tags and can be easily filtered for particular purposes. What is more the text becomes a multi-use resource, which is my point (databasing cannot do this).

However, with TEI it is possible to reduce it to just a handfull of tags, just enough in fact to translate into something as simple as epub, or for that matter ZML. Moreover, translating into PDF for printing etc.,. while not trivial at this stage poses no insurmountable problem.

The idea is not applicable to people selling ebooks, or making them. However, Gutenberg is much more than this, potentially it and others like it are a new Alexandrian Library. And that requires a scholarly approach to how to keep the texts in their most useful form for all sorts of predictable and unpredictable uses. If we are talking of marked-up text for a purpose like that TEI is it (the only system developed enough to thoroughly markup literature from manuscripts, scientific articles and novels, plays, corpus collections, dictionaries etc.,.).

I am not however talking about some mammoth operation to apply TEI, just the marking out of a simple cut down version compatible with building more into the markup as time goes by.

Making no reference to ZML, but to epub which I have a sleight acquaintance, it could well be the model of what such a cut down version should be, a stage one markup could well be nearly a one to one conversion, simply changing the element and attribute names and filtering out anything else.

I am saying this approach is the most suitable for storing literature as a long term cultural asset. Not that it helps in any way eink readers or anything similar (catering for their use, a thousand times yes, but not designing the text markup for simply reading them on current devices).

Quote:

Originally Posted by bowerbird

> I would suggest, that TEI (text Encoding Initiative)
> is the only candidate.

oh sheesh! you want to jump _directly_
to the heaviest of the heavy, don't you? :+)

good luck with that. that's been the plan of
the technoid faction over at p.g. for... well,
going on 6 years now... going _nowhere_...

-bowerbird

I am no techniod, but an academically inclined teacher, eager to have good Literature made available in a flexible and future proofed, form. At least at the start of the thread this seem to be the main concern, how to adapt Gutenberg's resources for this Second digital revolution.

I am the first to agree that coding fulling in TEI is a nightmare, that the editors for this are nowhere developed enough, and that trying to apply this form of markup by non-scholars is a recipe for disaster as it stands.

However, the system which is nearly fully developed TEI 5 is a huge improvement and far more extensive than anything before it, and solves the dire problem of multiple different and incompatible markup schemes being applied to the same text and in the same file.

There are ways it can be used in a very limited and cut-down easy to apply fashion, in fact this could be done quickly by just a handful of people familiar with TEI and the needs of such thing as ebook reading. There are also ways of using markup externally to the text file (a structural markup stylesheet).

The real virtue of TEI is its thoroughly developed element structure, and that it has been designed to cope with the most diverse textual material to a scholarly level.

I think we might therefore be at cross purposes. I will have a look at ZML when I get the chance, it is possible it might just be the thing, but I find it incredibly hard to imagine a simple solution to such a complex problem of marking up literature, storing it, and developing its structural analysis and use by applying more and more markup overtime.

I have a little experience in dealing with quality voice synthesis, which in the near future may well be put into handheld readers. I can say with some authority that TEI's markup solution is superior to any other approach I have come across (on a number of grounds).

Voice direction is not compatible to textual structure as one might assume. Speakers sometimes speak together (two different structures combined), voices may blend behind, sometimes a SFX may play behind any number of speakers, dialogues. What I am saying is that the nested nature of markup is necessary, yet adding voice markup can violate that. TEI has a compatible solution, I know of no other system that can mix the two.

Besides which markup a huge potential variety of text sources, from ancient epigraphs, parallel commentary, embedding translations of odd terms, musical notation, dramatic pieces etc.,. these things need a very elaborate system to do justice to the content.

Several basic standard types of severely reduced TEI markup would be an ideal solution for Gutenberg - HTML and epub, just cannot cut the mustard in the long term, nor should we expect it to.

GregS · 11-09-2007, 07:37 AM

PS to add one aside - the problem of Unicode.

I very much favour, because we must deal with the languages of the world that we should move away from ASCII.

However, the problem is not the scripts whatever the language, but page decoration and punctuation.

In XML/XHTML/TEI whatever, entities solve a good deal of this because they are unambiguous. Fonts do not cover everything and even curly quotes pose problems when rendered into direct code.

I am suggesting that entity sets need to be applied heavily, to maintain the long term integrity of texts. It is at the cost of size, but an apostrophe is not a closing single quote, though the glyph is.

We cannot get the method of glyths mixed up with the system of original writing. How something might be rendered should not become a substitute for the mark being rendered.

GregS · 11-09-2007, 08:42 AM

PPS

Is this the ZML?

http://rx4rdf.liminalzone.org/ZMLMarkupRules

If it is I have some preliminary opinions.

"Like the Wiki and SLiP formats its goal is to be a human-friendly markup language: simple, clear, and concise."

What I like about the bulkiness of XML is that it is passive and robust, and simple to repair - how simple, clear and concise it is from this point of view largely irrelevant. Beginning and end tags are bulky and inefficient, but they make things robust. A typo, an accidental deletion etc., can leave clues behind - more efficient marking up also means more fragile.

But no one in their right mind wants to directly work with XML tags ( or should not, I suspect that many that do are not in their right minds, or just trapped by current technology).

As a processing and composing language ZML has some virtues, but for me this is not enough, things have to be future proofed and robustly constructed - that means that redundancies (such as element tags for closure) that are in fact a good thing for storage, stablity, flexiblity and preservation, in terms of rendering transmission etc., the overhead of preprocessing XML into another form is well worth the costs.

ZML as a method no doubt has its uses.

Two other approaches I prefer, REBOL and LUA read XML directly into their data structures, which can then be manipulated by the script language like any other data. Both can save out as XML at any time.

I don't know if this makes sense in this context, but I favour lean applications and fat data. I have been with computers since the Apple II, from when everything was a squeeze. I don't really miss the old applications, but the loss of data still hurts, i have long thrown out a lot of things (half-written books, notes, articles etc.,.) I have written simply could not be read.

Digital Preservation is a critically important issue. The solution is robust redundancies, rich data, and most of all simplicity. The cost is fatter files, a little extra processing overhead - and it is all well worth the price, if we can preserve unambiguously what we already have.

Writing on paper disappeared only with the paper itself rotted away. No new improvements in the press, bindings or reading glasses effected what had been preserved.

What was written before did not disappear when a new pair of glasses were purchased. Until recently new digital glasses made previous words disappear. XML and XHML preserve some aspects well, but other standards are needed to preserve it better (TEI in the end I suggest for Literature).

ZML has its place, but not as a method of storage.

bowerbird · 11-09-2007, 01:36 PM

greg said:
> Is this the ZML?
> http://rx4rdf.liminalzone.org/ZMLMarkupRules
> If it is I have some preliminary opinions.

um, no. most emphatically not.

i've pointed to my site several times:
> http://z-m-l.com

for the latest summary of the work available
-- almost all of it demos, proof-of-concept, etc. -- see:
> http://z-m-l.com/go/pudding_sampler.html

-bowerbird

bowerbird · 11-09-2007, 02:08 PM

greg said:
> The primary thing is that the text be properly structured
> for the widest possible uses now and in the future.

that _sounds_ good. until you realize that -- depending on
how one defines "properly structured", and how one considers
"the widest possible uses", not to mention the crystal-ball on
"the future" -- doing heavy markup might be _very_ expensive.

so expensive -- quite literally -- that we cannot afford to do it.

heck, did you notice that -- until google decided to step in --
we couldn't even find funds to _scan_ the books in our libraries.
and scanning is dirt-cheap compared to applying heavy-markup.

and the other thing to keep in mind is that society is generating
new content at a numbing rate, a rate that's even ever-increasing!
and precious little of that content is marked up, not even in .html.

so, you know, in _my_ humble opinion (as people say), the idea
that we can make the _assumption_ that our data is marked up
-- and marked up with something intense like .tei -- is _silly_...
to the point of -- in my _humble_ opinion -- being ridiculous...
(a strong word, even extreme, but i think it is fully appropriate,
since -- as far as i can see -- this assumption has zero reality.)

would it be _nice_ if all our text could be extensively marked up,
such that it could magically be transformed any way we wanted?
well, _sure_ it would. it'd be _great_.

but, considered from the cost-benefit perspective that everything
must work under in our world, the benefits don't even come _close_
to justifying the very high costs of applying that extensive markup.

so we need something else, which gives us _most_ of the benefits,
at a _much_ lower cost. and that "something else" is light-markup.

> At least at the start of the thread this seem to be the main concern,
> how to adapt Gutenberg's resources for this Second digital revolution.

actually, the thread started out only as an attempt to
make a checklist of techniques that people have used
to make p.g. e-texts look _typographically_beautiful_...

> I will have a look at ZML when I get the chance,
> it is possible it might just be the thing

you're welcome to look at it, but i can pretty much tell you now that it
won't be a good fit, because your head wants an "ideal" markup system
-- which anticipates "any possible use, now or in the future" -- whereas
z.m.l. is fully grounded in the tradeoffs that a cost-benefit ratio demands.

> I find it incredibly hard to imagine a simple solution
> to such a complex problem of marking up literature, storing it

well, once you find out how little it costs, and the huge benefits it returns,
you might be surprised. but it's because you see the problem as "complex"
that i said you wouldn't be a good fit with z.m.l. you're one of those people
who love complexity. that's ok. doesn't mean that you're a bad person... :+)

-bowerbird

bob_ninja · 11-09-2007, 05:49 PM

This discussion is too long to read. Can someone summarize for me if you actually came up with any software for "cleaning up" G. text files?
I started writing my own tool and would like to avoid reinventing the wheel.
thanks

kovidgoyal · 11-09-2007, 05:55 PM

Short answer: no
Long answer: bowerbird claims to have a tool, but (as far as I understand) he's not going to release it, only use it to create his own mirror of p.g.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The "Closed Circle" is open for business	pholy	Deals and Resources (No Self-Promotion or Affiliate Links)	0	12-20-2009 10:24 PM
"SuperBook" project - British School studies e-books usage	TadW	News	2	06-28-2007 11:46 PM
Introducing the book: Gutenberg offers "in-home" tech support (humor)	nekokami	Lounge	1	05-07-2007 09:40 PM
"Gutenberg 2.0: le futur du livre" / iRex demoes Mobipocket on iLiad	Hadrien	News	4	03-27-2007 12:45 PM

11-08-2007, 07:26 PM	#121
kovidgoyal creator of calibre Posts: 45,659 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Sigh again with the accusations. zml does not support custom formatting of individual elements. When I make that statement it means that the markup language zml does not have support for specifying custom formatting of individual elements. Reader software will have support for individual formatting of structural elements not individual formatting of arbitrary elements. Please read that paragraph three times before replying.

11-08-2007, 07:44 PM	#122
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	greg said: > I would propose that the Gutenberg problem > does not lie in marking up for ebooks, but > rather a markup that allows easy translation > to things like epub (a very good move). well, gee, that would be _nice_. but the problem is that, in order to get .epub, you _do_ have to do markup. quite a lot of it. epub is xhtml/css underneath. (and not far...) so yeah, it would certainly be lovely if we could jump directly to epub without any markup, but it's not really possible. (you could also do what hadrien does at feedbooks, which is to put the book into a structured database, and _then_ churn out the .epub. he's doing markup, he's just doing it another kind of way, via database.) > It is matter of finding a light markup that can be > transformed coherently and consistently into > heavy markup, they may include voice markup, > reference markup, and complete structural markup, > that is potentially well beyond what > any present reader can handle. so now you want light-markup as a middleman. at first glance, that's an appealing position too... you avoid the high costs of doing much markup, but get the benefits that heavy markup "promises". and once again, it would be nice if you could get it. but you can't... well -- to be completely frank -- you kind of can... my routines can turn a light-markup file _into_ a heavy-markup file, and do a fairly good job... but let me tell you why i think that's a dead-end. consider the whole set of routines that will successfully convert the light-markup file into a heavy-markup file, which is then input to another app (call it "program p") for "purposes of presentation" (whatever form it takes). _instead_, put that set of routines right in "program p", so it inputs the light-markup file, does the conversion _itself_, and then goes on to act on the converted data. that's better, isn't it? you didn't have to convert it yourself, because "program p" did it for you. you avoided the mess of the intermediate file. (because, really, were you going to keep both the light-markup _and_ the heavy-markup files? because that's just a bunch of unnecessary file-overhead.) > I would suggest, that TEI (text Encoding Initiative) > is the only candidate. oh sheesh! you want to jump _directly_ to the heaviest of the heavy, don't you? :+) good luck with that. that's been the plan of the technoid faction over at p.g. for... well, going on 6 years now... going _nowhere_... -bowerbird

11-08-2007, 07:46 PM	#123
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	i could read it 100 times and i will not reply, because i've gotten off that merry-go-round. -bowerbird

11-08-2007, 07:49 PM	#124
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	rwood, i'm not sure where you're getting your "facts", but i'm not interested in the fight you want to pick. i'm not even gonna correct all the falsehoods you stated. -bowerbird

11-08-2007, 09:37 PM	#126
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	good. now, for anyone who wants to know _if_ z.m.l. _can_ do something -- something specific, something they _need_ -- feel free to ask me, and i'll be happy to tell you if it can, and how you would accomplish it. there are lots of people who seemingly want to tell you what z.m.l. can _not_ do, but i don't suggest you ask them, because they simply don't know my system like i know it. which only makes sense, yeah? there are lots of things that z.m.l. cannot do. if you are an author who wants to dictate the font(s) used in your book, you can't do it. you can't dictate the fontsize -- not even the _relative_ font-size -- or the color of any of the text, or background color(s), or margins, or the leading, or the pagesize, none of it, absolutely _none_ of it. can't even make _suggestions_ about the settings of those things... so, you know, if you _need_ those things, z.m.l. isn't right for you. because all of those variables are controlled _solely_ by the reader. oh, well, for the _record_, i have _considered_ a mechanism whereby the author could make "suggestions" about some of those dimensions, but i haven't made the decision whether i will actually _implement_ it. of course, the final say in the matter will always rest in _the_reader_. that is -- just for anyone who has been mistaken about it all along -- they'll be controlled by the _human_being_ who is _reading_ the book, who i call "the reader". (when i'm talking about the viewer-program, i call it "the viewer-program", the "viewer-app", or just "the viewer". but when i say "the reader", i'm talking about the breathing human... and it's that breathing human -- the one who is absorbing the words -- who makes the decisions about presentational aspects of a z.m.l. text. -bowerbird

11-09-2007, 07:37 AM	#130
GregS Zealot Posts: 107 Karma: 308 Join Date: Oct 2007 Location: Perth Australia Device: EZ Reader 5", Iliad	PS to add one aside - the problem of Unicode. I very much favour, because we must deal with the languages of the world that we should move away from ASCII. However, the problem is not the scripts whatever the language, but page decoration and punctuation. In XML/XHTML/TEI whatever, entities solve a good deal of this because they are unambiguous. Fonts do not cover everything and even curly quotes pose problems when rendered into direct code. I am suggesting that entity sets need to be applied heavily, to maintain the long term integrity of texts. It is at the cost of size, but an apostrophe is not a closing single quote, though the glyph is. We cannot get the method of glyths mixed up with the system of original writing. How something might be rendered should not become a substitute for the mark being rendered.

11-09-2007, 08:42 AM	#131
GregS Zealot Posts: 107 Karma: 308 Join Date: Oct 2007 Location: Perth Australia Device: EZ Reader 5", Iliad	PPS Is this the ZML? http://rx4rdf.liminalzone.org/ZMLMarkupRules If it is I have some preliminary opinions. "Like the Wiki and SLiP formats its goal is to be a human-friendly markup language: simple, clear, and concise." What I like about the bulkiness of XML is that it is passive and robust, and simple to repair - how simple, clear and concise it is from this point of view largely irrelevant. Beginning and end tags are bulky and inefficient, but they make things robust. A typo, an accidental deletion etc., can leave clues behind - more efficient marking up also means more fragile. But no one in their right mind wants to directly work with XML tags ( or should not, I suspect that many that do are not in their right minds, or just trapped by current technology). As a processing and composing language ZML has some virtues, but for me this is not enough, things have to be future proofed and robustly constructed - that means that redundancies (such as element tags for closure) that are in fact a good thing for storage, stablity, flexiblity and preservation, in terms of rendering transmission etc., the overhead of preprocessing XML into another form is well worth the costs. ZML as a method no doubt has its uses. Two other approaches I prefer, REBOL and LUA read XML directly into their data structures, which can then be manipulated by the script language like any other data. Both can save out as XML at any time. I don't know if this makes sense in this context, but I favour lean applications and fat data. I have been with computers since the Apple II, from when everything was a squeeze. I don't really miss the old applications, but the loss of data still hurts, i have long thrown out a lot of things (half-written books, notes, articles etc.,.) I have written simply could not be read. Digital Preservation is a critically important issue. The solution is robust redundancies, rich data, and most of all simplicity. The cost is fatter files, a little extra processing overhead - and it is all well worth the price, if we can preserve unambiguously what we already have. Writing on paper disappeared only with the paper itself rotted away. No new improvements in the press, bindings or reading glasses effected what had been preserved. What was written before did not disappear when a new pair of glasses were purchased. Until recently new digital glasses made previous words disappear. XML and XHML preserve some aspects well, but other standards are needed to preserve it better (TEI in the end I suggest for Literature). ZML has its place, but not as a method of storage.

11-09-2007, 01:36 PM	#132
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	greg said: > Is this the ZML? > http://rx4rdf.liminalzone.org/ZMLMarkupRules > If it is I have some preliminary opinions. um, no. most emphatically not. i've pointed to my site several times: > http://z-m-l.com for the latest summary of the work available -- almost all of it demos, proof-of-concept, etc. -- see: > http://z-m-l.com/go/pudding_sampler.html -bowerbird

11-09-2007, 02:08 PM	#133
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	greg said: > The primary thing is that the text be properly structured > for the widest possible uses now and in the future. that _sounds_ good. until you realize that -- depending on how one defines "properly structured", and how one considers "the widest possible uses", not to mention the crystal-ball on "the future" -- doing heavy markup might be _very_ expensive. so expensive -- quite literally -- that we cannot afford to do it. heck, did you notice that -- until google decided to step in -- we couldn't even find funds to _scan_ the books in our libraries. and scanning is dirt-cheap compared to applying heavy-markup. and the other thing to keep in mind is that society is generating new content at a numbing rate, a rate that's even ever-increasing! and precious little of that content is marked up, not even in .html. so, you know, in _my_ humble opinion (as people say), the idea that we can make the _assumption_ that our data is marked up -- and marked up with something intense like .tei -- is _silly_... to the point of -- in my _humble_ opinion -- being ridiculous... (a strong word, even extreme, but i think it is fully appropriate, since -- as far as i can see -- this assumption has zero reality.) would it be _nice_ if all our text could be extensively marked up, such that it could magically be transformed any way we wanted? well, _sure_ it would. it'd be _great_. but, considered from the cost-benefit perspective that everything must work under in our world, the benefits don't even come _close_ to justifying the very high costs of applying that extensive markup. so we need something else, which gives us _most_ of the benefits, at a _much_ lower cost. and that "something else" is light-markup. > At least at the start of the thread this seem to be the main concern, > how to adapt Gutenberg's resources for this Second digital revolution. actually, the thread started out only as an attempt to make a checklist of techniques that people have used to make p.g. e-texts look _typographically_beautiful_... > I will have a look at ZML when I get the chance, > it is possible it might just be the thing you're welcome to look at it, but i can pretty much tell you now that it won't be a good fit, because your head wants an "ideal" markup system -- which anticipates "any possible use, now or in the future" -- whereas z.m.l. is fully grounded in the tradeoffs that a cost-benefit ratio demands. > I find it incredibly hard to imagine a simple solution > to such a complex problem of marking up literature, storing it well, once you find out how little it costs, and the huge benefits it returns, you might be surprised. but it's because you see the problem as "complex" that i said you wouldn't be a good fit with z.m.l. you're one of those people who love complexity. that's ok. doesn't mean that you're a bad person... :+) -bowerbird

11-09-2007, 05:49 PM	#134
bob_ninja Addict Posts: 208 Karma: 582 Join Date: Aug 2006 Device: Zire71	This discussion is too long to read. Can someone summarize for me if you actually came up with any software for "cleaning up" G. text files? I started writing my own tool and would like to avoid reinventing the wheel. thanks

Advert

Advert

11-09-2007, 05:55 PM	#135
kovidgoyal creator of calibre Posts: 45,659 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Short answer: no Long answer: bowerbird claims to have a tool, but (as far as I understand) he's not going to release it, only use it to create his own mirror of p.g.