What "Cleaning Up" Do Project Gutenberg Texts Need [closed] - Page 10

GregS · 11-09-2007, 06:34 PM

bowerbird
"that _sounds_ good. until you realize that -- depending on
how one defines "properly structured", and how one considers
"the widest possible uses", not to mention the crystal-ball on
"the future" -- doing heavy markup might be _very_ expensive."

I do not understand this at all.

Full marking up in TEI is not being suggested, I am only suggesting a lightweight standard no more difficult than Xhtml or epub, but adapted for text repositories rather than display (though with CSS this poses no problem whatsoever - epub requires a trivial translation).

It was after having a look at Gutenberg Marker (which solves a lot of problems really very well), that a couple of extra steps would make such software very useful for establishing a standard ultra light TEI.

As for future proofing, you seem to miss the point altogether. Scholars have already developed concepts of textual structural analysis. If the structure can be unambiguously marked the text is future proofed, because one way or another it is the structure that has been the most elusive aspect of text and handling it for different purposes.

No magic involved, just progressively adding in tags by editors who do know what they are doing (TEI is as the texts are themselves inherently hard to do as the complexity of the markup meets the complexity of text itself).

I have taken at least seven or eight Shakespeare plays from Gutenberg, turned them into word processing documents, cleaned them and then through stylesheets reedited them and finally after a lot of effort (plays are really hard to do compared to novels) produce a pdf to print out copies for my students.

In short, though the tools are wrong for the purpose I have been doing just sort of thing because I had no choice - but the end result did not give anyone else useful. I could just as well have been properly marking up the play (using TEI derived tags) and placed back in Gutenberg something useful to others.

This is what I mean by progressively taging repository texts. Academics occasionally resort to Gutenberg, but whatever they do the text is lost to the repository as well. Students no doubt use the texts for study in their literature degrees, whatever they do is lost.

I have dealt with HTML and text versions of a variety of literature. For some purposes just being in HTML makes things very easy, but it also can make things a lot harder as well.

The other thing is that just reading texts (or printing them) is only one aspect of deigitizing texts. Storing them as virtual texts (with their structure preserved and readable) is vitally important for the preservation of literature. This is not just academic prejudice, it is what makes the texts adjustable to unpredictable future usage.

A chapter is not a title (a small criticism of Gutenberg Marker) it is a division that may or may not have a title. Hence I may in the future for whatever reason, desire to quickly retrieve Chapter Seven of "Pride and Prejudice" how can this be done unless the computer has a means of finding exactly what I asked it to find?

You don't need a crystal ball just an understanding of text itself from a scholarly point of view. These people have not been wasting their time, their precision is not useless but vital, and their knowledge (a part of which resides in the very code of TEI) cannot be ignored.

And I repeat an ultr-light version of TEI need be no more difficult than what we are already using, but it is not closed off like XHTML/epub or any other display technology. Being XML it is probably just as displayable in most contexts anyhow.

I looked at your references and had seen them, but as I could only see html markup in the source I went looking elsewhere. Sorry for the mistake.

My description of this thread as being about the Second Digital Revolution was not misplaced. The whole problem with Gutenberg at the moment is that it is rooted in the First, hence the compounding problems and the variety of solutions being proposed.

I have no prejudice against your system, except of course, until I trawl through this long thread I have no clear idea of what it is, I am based on other readers comments, not too sure I will be that much clearer if I do.

"you're welcome to look at it, but i can pretty much tell you now that it
won't be a good fit, because your head wants an "ideal" markup system
-- which anticipates "any possible use, now or in the future" -- whereas
z.m.l. is fully grounded in the tradeoffs that a cost-benefit ratio demands."

For me this just places it amongst display technologies, which is no bad place to be. The problems of text repositories is a different thing altogether.

Consider that a text is fully marked up in TEI (ie more tags than text). A huge labour but one that can accumulate over time in a systematic and reliable way. What is needed to translate it? "Find these tags and change them thus...." "Ignore every other tag" - the end result could be anything. As I wrote (with some help) something very similar in REBOL, I know that such a script be less than a page of code and is only a tiny delay in simply copying the file to a new location.

I also hold out some hope that students around the world may in the mid-term future look forward to having a device that is a notebook/reader capable of displaying TEI encoded documents in an academically useful way. However, the technology has to develop and for that it needs to establish a good market, for that the most important aspect is to establish standards such as epub, which may not be the most efficient or versatile, but make it possible to buy and keep literature with some assurance that on future devices it is either directly readable or can be made so.

My opinion, is that efficient display codes as you propose are not the real problem. I don't doubt everything you say about it, I have serious reservations that it answers the right question.

However, as a display technology it may well have a place. I would ad the proviso that if it can easily translate from epub and to epub then this would be a vital attribute in its acceptance, especially if it is as easy as you say to code with it.

To everyone else, please carefully consider the idea, though it be far removed from mobileread of the separate problem of text repositories.

For my part, when time permits, I intend to look carefully at epub and try and make a version of TEI to fit it, and then write a small program to translate it one way and the other.

If this looks any good in the end, I will set up a site, make people aware of it here, at various text repositories I know of and of course the TEI consortium. However, it is at least a month before I can seriously sit down with it, and maybe not even then.

If anyone is in a better position, to do the same thing, I will help in anyway I can. Ideally if it works well, epub software might easily be adapted to display it as well and thus solve the problem in one blow.

Greg Schofield
Perth Australia
(An English High School teacher)

GregS · 11-09-2007, 06:45 PM

bob_ninja I found reference to at least two when I first looked at this thread (somewhere). I looked at one http://www.sandroid.org/GutenMark/ which I have mistakenly referred to as Gutenberg Mark.

Having been dealing with Gutenberg texts for a good many years now, I wish I had found it earlier. For many purposes it looks really useful. I will certainly be using it.

I unfortunately did not look at the other software that was mentioned, and now cannot remember them. I will look again at the thread and if I can read the whole thing through (time is scarce) I will try and write a small summary.

But have a look at GutenMark, I have not downloaded it yet, but it seems to do all the most boring bits really well. I cannot say much more until I use it.

bowerbird · 11-09-2007, 06:54 PM

ninja bob said:
> I started writing my own tool
> and would like to avoid reinventing the wheel.

bob, i highly recommend that you proceed...

i wrote such a tool, so i can tell you that it was
one of the greatest programming experiences
that i've had in over 25 years of for/next loops.

it's not an easy task, but if you have persistence
-- ok, "tenacity" might be a more accurate word --
the resistance _will_ crumble from your onslaught.

the secret formula is to cumulate your successes,
not to seek one magic bullet. that's all i can say...

on the other hand, if you just want the p.g. library
to be _consistent_, so you can then add value to it,
and you have no particular appetite for the challenge
of coding the application to produce that consistency,
just wait until i release my mirror of the p.g. library...

it will be a consistent, structured version of the library,
in a format that makes it easy to write the routines to
recognize the structural elements within each e-text...

as just one example, if a line is preceded by more than
three blank lines, that line is a header for a new section.
the header is terminated by two successive blank lines.
if there is one blank line within the header, you have a
two-part header (e.g., chapter x / the lobster quadrille).
so there's the pseudo-code for finding headers in z.m.l.

-bowerbird

kovidgoyal · 11-09-2007, 07:23 PM

Quote:

Originally Posted by bowerbird

ninja bob said:
just wait until i release my mirror of the p.g. library...

You wouldn't happen to have a timeline on this would you.

bowerbird · 11-09-2007, 08:04 PM

> a timeline

i sure do. when the mirror is available, i'll let people know.

-bowerbird

bowerbird · 11-09-2007, 08:16 PM

greg said:
> I do not understand this at all.

i'm sorry. but that's ok. :+)

> I am only suggesting a lightweight standard
> no more difficult than Xhtml or epub

we differ on what constitutes "lightweight".

> no more difficult than Xhtml or epub

let me tell you the way i am framing this matter...

actually, let me just send you to the place directly:
> http://pgdp.net

that's the website for _distributed_proofreaders_.

they are the _volunteers_ who actually _digitize_
most of the project gutenberg e-texts these days.

they scan books, or find scan-sets from elsewhere,
do o.c.r. (or get that from elsewhere too), and then
subject the o.c.r. results to proofing and formatting
rounds administed right there in a web-based system.
after which results are assembled by a "postprocessor"
into the files that are submitted to and posted by p.g.

unless you've got an idea about somebody else doing it,
_these_ are the _volunteers_ who would do your markup.

i emphasized that these people are _volunteers_ because
they aren't getting paid to do this. they walk in the door
and are put to work. they don't necessarily have training.
they're motivated to work, but you can't order 'em around.
you can't force 'em to do something they don't want to do.

although a technoid faction has tried to entice them into
doing markup in tei-light, it's been some very slow going.
they spent about 6 years in various stages of "planning".
(yes, you did indeed read that right, i said _six_years_...)
over the last year, a few people did some 200 .tei e-texts.

the thought of treating the backlog of 15,000+ e-texts
hasn't even considered moving off of the back-burner...
(p.g.'s count is much higher, but they have _duplicates_
and an ever-increasing percentage of non-book items,
most recently .mp3 audio-books done by librevox.com.)

the person who developed the brand of .tei they're using
-- pgtei -- isn't interested in building any tools for them,
so there hasn't been much interest from the volunteers...

that's the situation as it exists today. that's how i frame it.
if you frame it differently, we won't understand each other...
perhaps you still consider this to be merely "hypothetical"?
that's not a bad thing. but there's a reality that's here now.

> As for future proofing, you seem to miss the point altogether.

i'm sorry. but that's ok... :+)

> I have taken at least seven or eight Shakespeare plays from Gutenberg,
> turned them into word processing documents, cleaned them and
> then through stylesheets reedited them and finally after a lot of effort
> (plays are really hard to do compared to novels) produce a pdf
> to print out copies for my students.

ok, now _that_ i understand. perfectly well.

my intention is to take those same plays -- and everything else --
from project gutenberg, run them through my app to convert them
into z.m.l., after which they can be printed out to .pdf immediately.

in other words, my converter-program does all the clean-up and
formatting work that took "a lot of effort" for you to do manually.

you saw -- from gutenmark -- how such a program can save time.
i intend for my program to be even better than gutenmark at that.

(and i _sincerely_ wish ron burkey was still developing gutenmark,
because i believe a healthy competition between us would be fun,
and push the state-of-the-art to a high level good for all of us...)

> In short, though the tools are wrong for the purpose I have
> been doing just sort of thing because I had no choice -
> but the end result did not give anyone else useful.

so you're aware of the same thing confronting the d.p. people,
that there are no tools which can help you do what you want...

i'm sorry about that.

and i don't have any suggestions for you, either.

except maybe to consider _why_ you don't have any tools, and
how that ramifies on the wisdom of the path you're choosing...

i _do_ think you are too hard on yourself when you say
"the end result did not give anyone else useful", because
your students _did_ get those plays in .pdf format, right?
(but you might want to consider rewriting that sentence.)

> I could just as well have been properly marking up the play
> (using TEI derived tags) and placed back in Gutenberg
> something useful to others.

i'm not altogether sure how "useful" a .tei file is these days.
but i'm sure you'll inform me that it _will_be_ useful later on.

> I have dealt with HTML and text versions of a variety of literature.
> For some purposes just being in HTML makes things very easy,
> but it also can make things a lot harder as well.

if you care to expand on that, i'm sure i would find it interesting. :+)

> Hence I may in the future for whatever reason,
> desire to quickly retrieve Chapter Seven of "Pride and Prejudice"
> how can this be done unless the computer has
> a means of finding exactly what I asked it to find?

well, in z.m.l. it would be simple for you to specify such a request...

i just made a post telling ninja bob how to find headers in a z.m.l. file.
so he'd point his program at pride and prejudice, and fish out chapter 7,
i.e., the chapter heading that said "chapter 7", or -- failing that -- just "7".

> These people have not been wasting their time,
> their precision is not useless but vital,
> and their knowledge (a part of which resides
> in the very code of TEI) cannot be ignored.

i don't ignore their knowledge.
i don't say they're "wasting their time".
and i certainly don't say their precision is "useless".
i think it would be _very_ useful. if only we could afford it...

> And I repeat an ultr-light version of TEI
> need be no more difficult than what we are already using

well, "we" -- as in you and i -- aren't "already using" the same thing.

.tei-light (or your "ultra-light") might not be more "difficult" than what
_you_ are already using, but it's _way_ more difficult than what i'm using.

but i'm not your target-audience anyway...

and neither are the people here at mobileread.

your target-audience is the volunteers over at distributed proofreaders.

and they don't even need to be "convinced" -- they already _agree_,
or at least they haven't mounted an outright revolt against pgtei --
but in order for them to actually start doing .tei, they need _tools_...

so if you want them to act, find some tools, and they'll be very happy...

if you don't _have_ any tools, you're just crying out in the wilderness...

> For me this just places it amongst display technologies,
> which is no bad place to be. The problems of text repositories
> is a different thing altogether.

cost-benefit ratio. that's all i can say: cost-benefit ratio.

> A huge labour but one that can accumulate
> over time in a systematic and reliable way.

go build such a library, and prove it has a superior cost-benefit ratio.

i predict you'll go broke before you ever start returning any benefits...
prove me wrong.

> I have serious reservations that it answers the right question.

i say that you're answering the wrong question, and
you're saying that i'm answering the wrong question.

that's what makes a horse-race.

> However, as a display technology it may well have a place.

thanks and all, but this horse-race is not for second-place.
it's to see who can _win_. i tell you my horse is gonna win.
your horse? certainly won't finish, and might not even get
out of the gate in the first place. so, does my saying that
give you motivation to prove me wrong? fine. then do it...

> I would ad the proviso that if it can easily translate
> from epub and to epub then this would be a vital attribute
> in its acceptance, especially if it is as easy as you say to code with it.

i'm not even gonna turn on that capacity. i don't have to.
plus doing so would only help to help out _your_horse_...
do your own markup. it's important that you feel its pain.
that cost is why your cost-benefit ratio will never be worth it.

> For my part, when time permits, I intend to look carefully
> at epub and try and make a version of TEI to fit it, and then
> write a small program to translate it one way and the other.

there are people doing that, if you don't want to duplicate effort.
they might love to have your help. do the research to find them.
of course, if you want to do it yourself, go ahead. i always do...
(however, i still do the research, because that's the smart thing.)

-bowerbird

p.s. if your target is project gutenberg, you should
_not_ develop your own "ultra-light" .tei, because
the guy who built their pgtei (who is their webmaster)
is _extremely_ protective of it, and will _attack_ you
if you try to suggest any alternative. (because, after all,
he's spent all these years developing it, so of course
he'll always believe he knows more about .tei than you.
he's a nasty character. don't cross him unless you dare.)

kovidgoyal · 11-09-2007, 09:02 PM

Quote:

Originally Posted by bowerbird

> a timeline

i sure do. when the mirror is available, i'll let people know.

-bowerbird

I meant when do you expect to be able to make the p.g. mirror available to the public, as in a date, or a length of time. If you dont have an estimate, that's fine too :-)

GregS · 11-09-2007, 09:59 PM

bowerbird many thanks for the reference PGTEI I have missed somehow (not surprising I miss a lot of things).

It looks like much of the most important stuff has been done this has been a great reference for me and although you disagree with the direction it is just what I was hoping for.

Given what seems to be there, a subset definition (which requires no change to PGTEI) especially for epub, or whatever, is not hard to do, or a script to produce it not fundementally difficult.

I have a little long term project of my own all about marking up text graphically, and have done just a little preliminary codeing for it. I am waiting on the release of REBOL 3 to move that particular project forward I cannot make even vague promises about it though.

Suffice it to say, the program tool/application problem is the biggy. However, it does not necessarily mean big applications to solve it, but a slightly different approach to how tags are in fact applied.

In short you have made my day.

In terms of your own project, I am still very hazy, I will latter today or tomorrow go through the thread carefully because you have given examples and see what I can make of them.

bob_ninja · 11-09-2007, 10:55 PM

Greg,
Thanks for the link. I examined its description. I actually have much more modest goals to simply adjust text format content and not upgrade it to HTML or any other richer format. I just want to get a better screen use for smaller reader screens.

For instance, I want to remove the annoying end-of-line markers that breakup paragraph into many segments and cause my reader to waste a lot of space:

After the line markers are removed:

Now the screen is filled and used very well, hence less scrolling.
I'll check the requirements list from the initial post and try to add more options. I'll post it as a freeware in a new post.

bowerbird,
I use Java which has regular expressions capability. I plan to simply build some search/replace regexp patterns and allow a user to enter his/her own to customize according to individual preferences. Shouldn't be too bad. Actually most of the work is for interfaces, GUI and CLI.

GregS · 11-09-2007, 11:41 PM

bob_ninja no worries and thanks.

I habitually use OpenOffice to clean up text, more for general convenience because its layout tools are reasonable for my purposes and its pdf creation reliable.

I always forget the correct grep code for end of line markers, and have to experiment a little each time in find/replace to get it right (I swear I always seem to forget the most used things - it is an annoying habit).

Hopefully in the not to distant future we may see a plethora of light weight gui tools that really do make lots of little jobs much much easier to perform.

bowerbird · 11-09-2007, 11:41 PM

ninja bob, when all you want to do is unwrap the hard line-breaks, try this:
> http://z-m-l.com/unwrap.pl

it works, for the most part, except it _also_ unwraps tables, poetry, and
other things which should _not_ be unwrapped. this is because of one of
the _biggest_ problems with the project gutenberg e-texts, namely that
these lines which should not be unwrapped are not unequivocally marked.

so one of the changes that i make when i convert a p.g. e-text to z.m.l. is to
_detect_ these lines, and then _mark_ them by giving them a leading space.
later, my unwrapping routines for z.m.l. _respect_ a leading space in a line
as a signal that that line should not be unwrapped. mission accomplished...

this is just one example of one change that needs to be done to a p.g. e-text
in order to make it more functional. z.m.l. as a whole is a _collection_ of
_all_ of those changes my focused research has deemed to be necessary.

some people will tell you a _human_ has to go through the e-text to "decide"
which lines should be marked as immune from rewrapping, that the decision
takes human intelligence, and cannot be programmed into a computer. well,
i won't tell you that my routines never make any mistakes, because they do.
but i _can_ inform you that they make _most_ of the decisions correctly, and
that's because i worked, and worked some more, then worked even _more_,
so that they _would_ make most of the decisions correctly.

so one part of the better functionality my z.m.l. mirror will give developers
will be the ability to unwrap the text at will, without introducing problems...

-bowerbird

bowerbird · 11-09-2007, 11:58 PM

> I meant when do you expect to be able to make the p.g. mirror available to the public,
> as in a date, or a length of time. If you dont have an estimate, that's fine too :-)

yeah, i know that's what you meant. and that's why i answered like i did...

i learned a long time not to make estimates. don't expect it until it actually arrives...

having said that, i'll also say "it depends". (which, yeah, isn't any more informative.)

the body of each e-text is pretty much already in z.m.l. format.
to the extent that it's not, the changes are pretty much automatic.
if that was all i was concerned about, i could do it in a week or two.

the problem area for each e-text is the front-matter: the title-page,
table of contents, dedication, list of illustrations, all that type of stuff.
what i _want_ to do is edit all of that to an extremely high standard...

but it's pretty slow going. even at 5 minutes per e-text, that's 12/hour,
or 100 for an 8-hour day. and slackers like me don't work 8-hour days.
so when you've got 15,000 of the suckers, even 5 minutes per adds up.

eventually, after enough time goes on and i keep avoiding this task,
i'll undoubtedly drop my desire to hit that high standard, and go with
something more quick-and-dirty. i've noticed that hadrien settles for
the title and the author on the title-page and then jumps into the book.
if i did that, i could pull the info from the catalog, and it'd be very quick.
if i decided to try and write some code to rework the front-matter that
is actually present in each e-text, then that might or might not be quick,
depending on how well the programming went. could even be very slow.
i can't even do an estimate on that until i've hand-edited enough e-texts
to get a handle on what the typical edits are, and how to automate 'em...

i've also considered building a wiki and asking the public to go at it...

so, depending on how all of this shakes out, it could be relatively soon,
or it could drag on for a little while, or it could drag on for a long while.

but i certainly don't advise that anyone hold their breath waiting for it...

indeed, don't even _expect_ it until it has actually arrived...

-bowerbird

bowerbird · 11-10-2007, 12:07 AM

greg said:
> However, it does not necessarily mean big applications to solve it,
> but a slightly different approach to how tags are in fact applied.

ok.

i'm not sure _exactly_ what that means, but ok... :+)

do keep in mind that, in the real-world of project gutenberg today,
the tags are being applied by distributed proofreader volunteers...

now maybe you have something completely different in mind...

but if in your mind those volunteers would be applying .tei tags,
then you really need to go over and introduce yourself to them.
they pride themselves on being very friendly -- they'll tell you
that over and over and over -- but another truth is that they
don't take kindly to strangers telling them how to do their job.
so you will need to bow down and ingratiate yourself to them
before you should even whisper a suggestion about what to do.
_especially_ about .tei, because it's been a "plan" for so long...

> In short you have made my day.

great.

not everyone else feels the same, but
that's the way life is in the honest lane.

glad i could be of some help...

-bowerbird

kovidgoyal · 11-10-2007, 12:08 AM

Doesn't gutenmark already handle front matter? You could just lift the routines from there.

bowerbird · 11-10-2007, 12:23 AM

> Doesn't gutenmark already handle front matter?

we have different definitions of "handle".

the title-page in a z.m.l. file is highly structured,
because its info is collected into a library catalog.

many of the other parts of z.m.l. front-matter are
expected to conform to a certain framework too...

on the other hand, front-matter in p.g. e-texts is
probably _the_ most wildly inconsistent element
in the entire catalog, which is not surprising when
you consider that it is coming from a wide range
of different publishers, so taming it is difficult...

-bowerbird

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The "Closed Circle" is open for business	pholy	Deals and Resources (No Self-Promotion or Affiliate Links)	0	12-20-2009 09:24 PM
"SuperBook" project - British School studies e-books usage	TadW	News	2	06-28-2007 10:46 PM
Introducing the book: Gutenberg offers "in-home" tech support (humor)	nekokami	Lounge	1	05-07-2007 08:40 PM
"Gutenberg 2.0: le futur du livre" / iRex demoes Mobipocket on iLiad	Hadrien	News	4	03-27-2007 11:45 AM

11-09-2007, 06:34 PM	#136
GregS Zealot Posts: 107 Karma: 308 Join Date: Oct 2007 Location: Perth Australia Device: EZ Reader 5", Iliad	bowerbird "that _sounds_ good. until you realize that -- depending on how one defines "properly structured", and how one considers "the widest possible uses", not to mention the crystal-ball on "the future" -- doing heavy markup might be _very_ expensive." I do not understand this at all. Full marking up in TEI is not being suggested, I am only suggesting a lightweight standard no more difficult than Xhtml or epub, but adapted for text repositories rather than display (though with CSS this poses no problem whatsoever - epub requires a trivial translation). It was after having a look at Gutenberg Marker (which solves a lot of problems really very well), that a couple of extra steps would make such software very useful for establishing a standard ultra light TEI. As for future proofing, you seem to miss the point altogether. Scholars have already developed concepts of textual structural analysis. If the structure can be unambiguously marked the text is future proofed, because one way or another it is the structure that has been the most elusive aspect of text and handling it for different purposes. No magic involved, just progressively adding in tags by editors who do know what they are doing (TEI is as the texts are themselves inherently hard to do as the complexity of the markup meets the complexity of text itself). I have taken at least seven or eight Shakespeare plays from Gutenberg, turned them into word processing documents, cleaned them and then through stylesheets reedited them and finally after a lot of effort (plays are really hard to do compared to novels) produce a pdf to print out copies for my students. In short, though the tools are wrong for the purpose I have been doing just sort of thing because I had no choice - but the end result did not give anyone else useful. I could just as well have been properly marking up the play (using TEI derived tags) and placed back in Gutenberg something useful to others. This is what I mean by progressively taging repository texts. Academics occasionally resort to Gutenberg, but whatever they do the text is lost to the repository as well. Students no doubt use the texts for study in their literature degrees, whatever they do is lost. I have dealt with HTML and text versions of a variety of literature. For some purposes just being in HTML makes things very easy, but it also can make things a lot harder as well. The other thing is that just reading texts (or printing them) is only one aspect of deigitizing texts. Storing them as virtual texts (with their structure preserved and readable) is vitally important for the preservation of literature. This is not just academic prejudice, it is what makes the texts adjustable to unpredictable future usage. A chapter is not a title (a small criticism of Gutenberg Marker) it is a division that may or may not have a title. Hence I may in the future for whatever reason, desire to quickly retrieve Chapter Seven of "Pride and Prejudice" how can this be done unless the computer has a means of finding exactly what I asked it to find? You don't need a crystal ball just an understanding of text itself from a scholarly point of view. These people have not been wasting their time, their precision is not useless but vital, and their knowledge (a part of which resides in the very code of TEI) cannot be ignored. And I repeat an ultr-light version of TEI need be no more difficult than what we are already using, but it is not closed off like XHTML/epub or any other display technology. Being XML it is probably just as displayable in most contexts anyhow. I looked at your references and had seen them, but as I could only see html markup in the source I went looking elsewhere. Sorry for the mistake. My description of this thread as being about the Second Digital Revolution was not misplaced. The whole problem with Gutenberg at the moment is that it is rooted in the First, hence the compounding problems and the variety of solutions being proposed. I have no prejudice against your system, except of course, until I trawl through this long thread I have no clear idea of what it is, I am based on other readers comments, not too sure I will be that much clearer if I do. "you're welcome to look at it, but i can pretty much tell you now that it won't be a good fit, because your head wants an "ideal" markup system -- which anticipates "any possible use, now or in the future" -- whereas z.m.l. is fully grounded in the tradeoffs that a cost-benefit ratio demands." For me this just places it amongst display technologies, which is no bad place to be. The problems of text repositories is a different thing altogether. Consider that a text is fully marked up in TEI (ie more tags than text). A huge labour but one that can accumulate over time in a systematic and reliable way. What is needed to translate it? "Find these tags and change them thus...." "Ignore every other tag" - the end result could be anything. As I wrote (with some help) something very similar in REBOL, I know that such a script be less than a page of code and is only a tiny delay in simply copying the file to a new location. I also hold out some hope that students around the world may in the mid-term future look forward to having a device that is a notebook/reader capable of displaying TEI encoded documents in an academically useful way. However, the technology has to develop and for that it needs to establish a good market, for that the most important aspect is to establish standards such as epub, which may not be the most efficient or versatile, but make it possible to buy and keep literature with some assurance that on future devices it is either directly readable or can be made so. My opinion, is that efficient display codes as you propose are not the real problem. I don't doubt everything you say about it, I have serious reservations that it answers the right question. However, as a display technology it may well have a place. I would ad the proviso that if it can easily translate from epub and to epub then this would be a vital attribute in its acceptance, especially if it is as easy as you say to code with it. To everyone else, please carefully consider the idea, though it be far removed from mobileread of the separate problem of text repositories. For my part, when time permits, I intend to look carefully at epub and try and make a version of TEI to fit it, and then write a small program to translate it one way and the other. If this looks any good in the end, I will set up a site, make people aware of it here, at various text repositories I know of and of course the TEI consortium. However, it is at least a month before I can seriously sit down with it, and maybe not even then. If anyone is in a better position, to do the same thing, I will help in anyway I can. Ideally if it works well, epub software might easily be adapted to display it as well and thus solve the problem in one blow. Greg Schofield Perth Australia (An English High School teacher)

11-09-2007, 06:45 PM	#137
GregS Zealot Posts: 107 Karma: 308 Join Date: Oct 2007 Location: Perth Australia Device: EZ Reader 5", Iliad	bob_ninja I found reference to at least two when I first looked at this thread (somewhere). I looked at one http://www.sandroid.org/GutenMark/ which I have mistakenly referred to as Gutenberg Mark. Having been dealing with Gutenberg texts for a good many years now, I wish I had found it earlier. For many purposes it looks really useful. I will certainly be using it. I unfortunately did not look at the other software that was mentioned, and now cannot remember them. I will look again at the thread and if I can read the whole thing through (time is scarce) I will try and write a small summary. But have a look at GutenMark, I have not downloaded it yet, but it seems to do all the most boring bits really well. I cannot say much more until I use it.

11-09-2007, 06:54 PM	#138
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	ninja bob said: > I started writing my own tool > and would like to avoid reinventing the wheel. bob, i highly recommend that you proceed... i wrote such a tool, so i can tell you that it was one of the greatest programming experiences that i've had in over 25 years of for/next loops. it's not an easy task, but if you have persistence -- ok, "tenacity" might be a more accurate word -- the resistance _will_ crumble from your onslaught. the secret formula is to cumulate your successes, not to seek one magic bullet. that's all i can say... on the other hand, if you just want the p.g. library to be _consistent_, so you can then add value to it, and you have no particular appetite for the challenge of coding the application to produce that consistency, just wait until i release my mirror of the p.g. library... it will be a consistent, structured version of the library, in a format that makes it easy to write the routines to recognize the structural elements within each e-text... as just one example, if a line is preceded by more than three blank lines, that line is a header for a new section. the header is terminated by two successive blank lines. if there is one blank line within the header, you have a two-part header (e.g., chapter x / the lobster quadrille). so there's the pseudo-code for finding headers in z.m.l. -bowerbird

11-09-2007, 08:04 PM	#140
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	> a timeline i sure do. when the mirror is available, i'll let people know. -bowerbird

11-09-2007, 08:16 PM	#141
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	greg said: > I do not understand this at all. i'm sorry. but that's ok. :+) > I am only suggesting a lightweight standard > no more difficult than Xhtml or epub we differ on what constitutes "lightweight". > no more difficult than Xhtml or epub let me tell you the way i am framing this matter... actually, let me just send you to the place directly: > http://pgdp.net that's the website for _distributed_proofreaders_. they are the _volunteers_ who actually _digitize_ most of the project gutenberg e-texts these days. they scan books, or find scan-sets from elsewhere, do o.c.r. (or get that from elsewhere too), and then subject the o.c.r. results to proofing and formatting rounds administed right there in a web-based system. after which results are assembled by a "postprocessor" into the files that are submitted to and posted by p.g. unless you've got an idea about somebody else doing it, _these_ are the _volunteers_ who would do your markup. i emphasized that these people are _volunteers_ because they aren't getting paid to do this. they walk in the door and are put to work. they don't necessarily have training. they're motivated to work, but you can't order 'em around. you can't force 'em to do something they don't want to do. although a technoid faction has tried to entice them into doing markup in tei-light, it's been some very slow going. they spent about 6 years in various stages of "planning". (yes, you did indeed read that right, i said _six_years_...) over the last year, a few people did some 200 .tei e-texts. the thought of treating the backlog of 15,000+ e-texts hasn't even considered moving off of the back-burner... (p.g.'s count is much higher, but they have _duplicates_ and an ever-increasing percentage of non-book items, most recently .mp3 audio-books done by librevox.com.) the person who developed the brand of .tei they're using -- pgtei -- isn't interested in building any tools for them, so there hasn't been much interest from the volunteers... that's the situation as it exists today. that's how i frame it. if you frame it differently, we won't understand each other... perhaps you still consider this to be merely "hypothetical"? that's not a bad thing. but there's a reality that's here now. > As for future proofing, you seem to miss the point altogether. i'm sorry. but that's ok... :+) > I have taken at least seven or eight Shakespeare plays from Gutenberg, > turned them into word processing documents, cleaned them and > then through stylesheets reedited them and finally after a lot of effort > (plays are really hard to do compared to novels) produce a pdf > to print out copies for my students. ok, now _that_ i understand. perfectly well. my intention is to take those same plays -- and everything else -- from project gutenberg, run them through my app to convert them into z.m.l., after which they can be printed out to .pdf immediately. in other words, my converter-program does all the clean-up and formatting work that took "a lot of effort" for you to do manually. you saw -- from gutenmark -- how such a program can save time. i intend for my program to be even better than gutenmark at that. (and i _sincerely_ wish ron burkey was still developing gutenmark, because i believe a healthy competition between us would be fun, and push the state-of-the-art to a high level good for all of us...) > In short, though the tools are wrong for the purpose I have > been doing just sort of thing because I had no choice - > but the end result did not give anyone else useful. so you're aware of the same thing confronting the d.p. people, that there are no tools which can help you do what you want... i'm sorry about that. and i don't have any suggestions for you, either. except maybe to consider _why_ you don't have any tools, and how that ramifies on the wisdom of the path you're choosing... i _do_ think you are too hard on yourself when you say "the end result did not give anyone else useful", because your students _did_ get those plays in .pdf format, right? (but you might want to consider rewriting that sentence.) > I could just as well have been properly marking up the play > (using TEI derived tags) and placed back in Gutenberg > something useful to others. i'm not altogether sure how "useful" a .tei file is these days. but i'm sure you'll inform me that it _will_be_ useful later on. > I have dealt with HTML and text versions of a variety of literature. > For some purposes just being in HTML makes things very easy, > but it also can make things a lot harder as well. if you care to expand on that, i'm sure i would find it interesting. :+) > Hence I may in the future for whatever reason, > desire to quickly retrieve Chapter Seven of "Pride and Prejudice" > how can this be done unless the computer has > a means of finding exactly what I asked it to find? well, in z.m.l. it would be simple for you to specify such a request... i just made a post telling ninja bob how to find headers in a z.m.l. file. so he'd point his program at pride and prejudice, and fish out chapter 7, i.e., the chapter heading that said "chapter 7", or -- failing that -- just "7". > These people have not been wasting their time, > their precision is not useless but vital, > and their knowledge (a part of which resides > in the very code of TEI) cannot be ignored. i don't ignore their knowledge. i don't say they're "wasting their time". and i certainly don't say their precision is "useless". i think it would be _very_ useful. if only we could afford it... > And I repeat an ultr-light version of TEI > need be no more difficult than what we are already using well, "we" -- as in you and i -- aren't "already using" the same thing. .tei-light (or your "ultra-light") might not be more "difficult" than what _you_ are already using, but it's _way_ more difficult than what i'm using. but i'm not your target-audience anyway... and neither are the people here at mobileread. your target-audience is the volunteers over at distributed proofreaders. and they don't even need to be "convinced" -- they already _agree_, or at least they haven't mounted an outright revolt against pgtei -- but in order for them to actually start doing .tei, they need _tools_... so if you want them to act, find some tools, and they'll be very happy... if you don't _have_ any tools, you're just crying out in the wilderness... > For me this just places it amongst display technologies, > which is no bad place to be. The problems of text repositories > is a different thing altogether. cost-benefit ratio. that's all i can say: cost-benefit ratio. > A huge labour but one that can accumulate > over time in a systematic and reliable way. go build such a library, and prove it has a superior cost-benefit ratio. i predict you'll go broke before you ever start returning any benefits... prove me wrong. > I have serious reservations that it answers the right question. i say that you're answering the wrong question, and you're saying that i'm answering the wrong question. that's what makes a horse-race. > However, as a display technology it may well have a place. thanks and all, but this horse-race is not for second-place. it's to see who can _win_. i tell you my horse is gonna win. your horse? certainly won't finish, and might not even get out of the gate in the first place. so, does my saying that give you motivation to prove me wrong? fine. then do it... > I would ad the proviso that if it can easily translate > from epub and to epub then this would be a vital attribute > in its acceptance, especially if it is as easy as you say to code with it. i'm not even gonna turn on that capacity. i don't have to. plus doing so would only help to help out _your_horse_... do your own markup. it's important that you feel its pain. that cost is why your cost-benefit ratio will never be worth it. > For my part, when time permits, I intend to look carefully > at epub and try and make a version of TEI to fit it, and then > write a small program to translate it one way and the other. there are people doing that, if you don't want to duplicate effort. they might love to have your help. do the research to find them. of course, if you want to do it yourself, go ahead. i always do... (however, i still do the research, because that's the smart thing.) -bowerbird p.s. if your target is project gutenberg, you should _not_ develop your own "ultra-light" .tei, because the guy who built their pgtei (who is their webmaster) is _extremely_ protective of it, and will _attack_ you if you try to suggest any alternative. (because, after all, he's spent all these years developing it, so of course he'll always believe he knows more about .tei than you. he's a nasty character. don't cross him unless you dare.)

11-09-2007, 09:59 PM	#143
GregS Zealot Posts: 107 Karma: 308 Join Date: Oct 2007 Location: Perth Australia Device: EZ Reader 5", Iliad	bowerbird many thanks for the reference PGTEI I have missed somehow (not surprising I miss a lot of things). It looks like much of the most important stuff has been done this has been a great reference for me and although you disagree with the direction it is just what I was hoping for. Given what seems to be there, a subset definition (which requires no change to PGTEI) especially for epub, or whatever, is not hard to do, or a script to produce it not fundementally difficult. I have a little long term project of my own all about marking up text graphically, and have done just a little preliminary codeing for it. I am waiting on the release of REBOL 3 to move that particular project forward I cannot make even vague promises about it though. Suffice it to say, the program tool/application problem is the biggy. However, it does not necessarily mean big applications to solve it, but a slightly different approach to how tags are in fact applied. In short you have made my day. In terms of your own project, I am still very hazy, I will latter today or tomorrow go through the thread carefully because you have given examples and see what I can make of them.

11-09-2007, 10:55 PM	#144
bob_ninja Addict Posts: 208 Karma: 582 Join Date: Aug 2006 Device: Zire71	Greg, Thanks for the link. I examined its description. I actually have much more modest goals to simply adjust text format content and not upgrade it to HTML or any other richer format. I just want to get a better screen use for smaller reader screens. For instance, I want to remove the annoying end-of-line markers that breakup paragraph into many segments and cause my reader to waste a lot of space: After the line markers are removed: Now the screen is filled and used very well, hence less scrolling. I'll check the requirements list from the initial post and try to add more options. I'll post it as a freeware in a new post. bowerbird, I use Java which has regular expressions capability. I plan to simply build some search/replace regexp patterns and allow a user to enter his/her own to customize according to individual preferences. Shouldn't be too bad. Actually most of the work is for interfaces, GUI and CLI.

11-09-2007, 11:41 PM	#145
GregS Zealot Posts: 107 Karma: 308 Join Date: Oct 2007 Location: Perth Australia Device: EZ Reader 5", Iliad	bob_ninja no worries and thanks. I habitually use OpenOffice to clean up text, more for general convenience because its layout tools are reasonable for my purposes and its pdf creation reliable. I always forget the correct grep code for end of line markers, and have to experiment a little each time in find/replace to get it right (I swear I always seem to forget the most used things - it is an annoying habit). Hopefully in the not to distant future we may see a plethora of light weight gui tools that really do make lots of little jobs much much easier to perform.

11-09-2007, 11:41 PM	#146
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	ninja bob, when all you want to do is unwrap the hard line-breaks, try this: > http://z-m-l.com/unwrap.pl it works, for the most part, except it _also_ unwraps tables, poetry, and other things which should _not_ be unwrapped. this is because of one of the _biggest_ problems with the project gutenberg e-texts, namely that these lines which should not be unwrapped are not unequivocally marked. so one of the changes that i make when i convert a p.g. e-text to z.m.l. is to _detect_ these lines, and then _mark_ them by giving them a leading space. later, my unwrapping routines for z.m.l. _respect_ a leading space in a line as a signal that that line should not be unwrapped. mission accomplished... this is just one example of one change that needs to be done to a p.g. e-text in order to make it more functional. z.m.l. as a whole is a _collection_ of _all_ of those changes my focused research has deemed to be necessary. some people will tell you a _human_ has to go through the e-text to "decide" which lines should be marked as immune from rewrapping, that the decision takes human intelligence, and cannot be programmed into a computer. well, i won't tell you that my routines never make any mistakes, because they do. but i _can_ inform you that they make _most_ of the decisions correctly, and that's because i worked, and worked some more, then worked even _more_, so that they _would_ make most of the decisions correctly. so one part of the better functionality my z.m.l. mirror will give developers will be the ability to unwrap the text at will, without introducing problems... -bowerbird

11-09-2007, 11:58 PM	#147
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	> I meant when do you expect to be able to make the p.g. mirror available to the public, > as in a date, or a length of time. If you dont have an estimate, that's fine too :-) yeah, i know that's what you meant. and that's why i answered like i did... i learned a long time not to make estimates. don't expect it until it actually arrives... having said that, i'll also say "it depends". (which, yeah, isn't any more informative.) the body of each e-text is pretty much already in z.m.l. format. to the extent that it's not, the changes are pretty much automatic. if that was all i was concerned about, i could do it in a week or two. the problem area for each e-text is the front-matter: the title-page, table of contents, dedication, list of illustrations, all that type of stuff. what i _want_ to do is edit all of that to an extremely high standard... but it's pretty slow going. even at 5 minutes per e-text, that's 12/hour, or 100 for an 8-hour day. and slackers like me don't work 8-hour days. so when you've got 15,000 of the suckers, even 5 minutes per adds up. eventually, after enough time goes on and i keep avoiding this task, i'll undoubtedly drop my desire to hit that high standard, and go with something more quick-and-dirty. i've noticed that hadrien settles for the title and the author on the title-page and then jumps into the book. if i did that, i could pull the info from the catalog, and it'd be very quick. if i decided to try and write some code to rework the front-matter that is actually present in each e-text, then that might or might not be quick, depending on how well the programming went. could even be very slow. i can't even do an estimate on that until i've hand-edited enough e-texts to get a handle on what the typical edits are, and how to automate 'em... i've also considered building a wiki and asking the public to go at it... so, depending on how all of this shakes out, it could be relatively soon, or it could drag on for a little while, or it could drag on for a long while. but i certainly don't advise that anyone hold their breath waiting for it... indeed, don't even _expect_ it until it has actually arrived... -bowerbird

Advert

Advert

11-10-2007, 12:07 AM	#148
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	greg said: > However, it does not necessarily mean big applications to solve it, > but a slightly different approach to how tags are in fact applied. ok. i'm not sure _exactly_ what that means, but ok... :+) do keep in mind that, in the real-world of project gutenberg today, the tags are being applied by distributed proofreader volunteers... now maybe you have something completely different in mind... but if in your mind those volunteers would be applying .tei tags, then you really need to go over and introduce yourself to them. they pride themselves on being very friendly -- they'll tell you that over and over and over -- but another truth is that they don't take kindly to strangers telling them how to do their job. so you will need to bow down and ingratiate yourself to them before you should even whisper a suggestion about what to do. _especially_ about .tei, because it's been a "plan" for so long... > In short you have made my day. great. not everyone else feels the same, but that's the way life is in the honest lane. glad i could be of some help... -bowerbird

11-10-2007, 12:08 AM	#149
kovidgoyal creator of calibre Posts: 46,286 Karma: 29630860 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Doesn't gutenmark already handle front matter? You could just lift the routines from there.

11-10-2007, 12:23 AM	#150
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	> Doesn't gutenmark already handle front matter? we have different definitions of "handle". the title-page in a z.m.l. file is highly structured, because its info is collected into a library catalog. many of the other parts of z.m.l. front-matter are expected to conform to a certain framework too... on the other hand, front-matter in p.g. e-texts is probably _the_ most wildly inconsistent element in the entire catalog, which is not surprising when you consider that it is coming from a wide range of different publishers, so taming it is difficult... -bowerbird