MobileRead Forums - View Single Post - What "Cleaning Up" Do Project Gutenberg Texts Need [closed]

bowerbird · 11-09-2007, 08:16 PM

greg said:
> I do not understand this at all.

i'm sorry. but that's ok. :+)

> I am only suggesting a lightweight standard
> no more difficult than Xhtml or epub

we differ on what constitutes "lightweight".

> no more difficult than Xhtml or epub

let me tell you the way i am framing this matter...

actually, let me just send you to the place directly:
> http://pgdp.net

that's the website for _distributed_proofreaders_.

they are the _volunteers_ who actually _digitize_
most of the project gutenberg e-texts these days.

they scan books, or find scan-sets from elsewhere,
do o.c.r. (or get that from elsewhere too), and then
subject the o.c.r. results to proofing and formatting
rounds administed right there in a web-based system.
after which results are assembled by a "postprocessor"
into the files that are submitted to and posted by p.g.

unless you've got an idea about somebody else doing it,
_these_ are the _volunteers_ who would do your markup.

i emphasized that these people are _volunteers_ because
they aren't getting paid to do this. they walk in the door
and are put to work. they don't necessarily have training.
they're motivated to work, but you can't order 'em around.
you can't force 'em to do something they don't want to do.

although a technoid faction has tried to entice them into
doing markup in tei-light, it's been some very slow going.
they spent about 6 years in various stages of "planning".
(yes, you did indeed read that right, i said _six_years_...)
over the last year, a few people did some 200 .tei e-texts.

the thought of treating the backlog of 15,000+ e-texts
hasn't even considered moving off of the back-burner...
(p.g.'s count is much higher, but they have _duplicates_
and an ever-increasing percentage of non-book items,
most recently .mp3 audio-books done by librevox.com.)

the person who developed the brand of .tei they're using
-- pgtei -- isn't interested in building any tools for them,
so there hasn't been much interest from the volunteers...

that's the situation as it exists today. that's how i frame it.
if you frame it differently, we won't understand each other...
perhaps you still consider this to be merely "hypothetical"?
that's not a bad thing. but there's a reality that's here now.

> As for future proofing, you seem to miss the point altogether.

i'm sorry. but that's ok... :+)

> I have taken at least seven or eight Shakespeare plays from Gutenberg,
> turned them into word processing documents, cleaned them and
> then through stylesheets reedited them and finally after a lot of effort
> (plays are really hard to do compared to novels) produce a pdf
> to print out copies for my students.

ok, now _that_ i understand. perfectly well.

my intention is to take those same plays -- and everything else --
from project gutenberg, run them through my app to convert them
into z.m.l., after which they can be printed out to .pdf immediately.

in other words, my converter-program does all the clean-up and
formatting work that took "a lot of effort" for you to do manually.

you saw -- from gutenmark -- how such a program can save time.
i intend for my program to be even better than gutenmark at that.

(and i _sincerely_ wish ron burkey was still developing gutenmark,
because i believe a healthy competition between us would be fun,
and push the state-of-the-art to a high level good for all of us...)

> In short, though the tools are wrong for the purpose I have
> been doing just sort of thing because I had no choice -
> but the end result did not give anyone else useful.

so you're aware of the same thing confronting the d.p. people,
that there are no tools which can help you do what you want...

i'm sorry about that.

and i don't have any suggestions for you, either.

except maybe to consider _why_ you don't have any tools, and
how that ramifies on the wisdom of the path you're choosing...

i _do_ think you are too hard on yourself when you say
"the end result did not give anyone else useful", because
your students _did_ get those plays in .pdf format, right?
(but you might want to consider rewriting that sentence.)

> I could just as well have been properly marking up the play
> (using TEI derived tags) and placed back in Gutenberg
> something useful to others.

i'm not altogether sure how "useful" a .tei file is these days.
but i'm sure you'll inform me that it _will_be_ useful later on.

> I have dealt with HTML and text versions of a variety of literature.
> For some purposes just being in HTML makes things very easy,
> but it also can make things a lot harder as well.

if you care to expand on that, i'm sure i would find it interesting. :+)

> Hence I may in the future for whatever reason,
> desire to quickly retrieve Chapter Seven of "Pride and Prejudice"
> how can this be done unless the computer has
> a means of finding exactly what I asked it to find?

well, in z.m.l. it would be simple for you to specify such a request...

i just made a post telling ninja bob how to find headers in a z.m.l. file.
so he'd point his program at pride and prejudice, and fish out chapter 7,
i.e., the chapter heading that said "chapter 7", or -- failing that -- just "7".

> These people have not been wasting their time,
> their precision is not useless but vital,
> and their knowledge (a part of which resides
> in the very code of TEI) cannot be ignored.

i don't ignore their knowledge.
i don't say they're "wasting their time".
and i certainly don't say their precision is "useless".
i think it would be _very_ useful. if only we could afford it...

> And I repeat an ultr-light version of TEI
> need be no more difficult than what we are already using

well, "we" -- as in you and i -- aren't "already using" the same thing.

.tei-light (or your "ultra-light") might not be more "difficult" than what
_you_ are already using, but it's _way_ more difficult than what i'm using.

but i'm not your target-audience anyway...

and neither are the people here at mobileread.

your target-audience is the volunteers over at distributed proofreaders.

and they don't even need to be "convinced" -- they already _agree_,
or at least they haven't mounted an outright revolt against pgtei --
but in order for them to actually start doing .tei, they need _tools_...

so if you want them to act, find some tools, and they'll be very happy...

if you don't _have_ any tools, you're just crying out in the wilderness...

> For me this just places it amongst display technologies,
> which is no bad place to be. The problems of text repositories
> is a different thing altogether.

cost-benefit ratio. that's all i can say: cost-benefit ratio.

> A huge labour but one that can accumulate
> over time in a systematic and reliable way.

go build such a library, and prove it has a superior cost-benefit ratio.

i predict you'll go broke before you ever start returning any benefits...
prove me wrong.

> I have serious reservations that it answers the right question.

i say that you're answering the wrong question, and
you're saying that i'm answering the wrong question.

that's what makes a horse-race.

> However, as a display technology it may well have a place.

thanks and all, but this horse-race is not for second-place.
it's to see who can _win_. i tell you my horse is gonna win.
your horse? certainly won't finish, and might not even get
out of the gate in the first place. so, does my saying that
give you motivation to prove me wrong? fine. then do it...

> I would ad the proviso that if it can easily translate
> from epub and to epub then this would be a vital attribute
> in its acceptance, especially if it is as easy as you say to code with it.

i'm not even gonna turn on that capacity. i don't have to.
plus doing so would only help to help out _your_horse_...
do your own markup. it's important that you feel its pain.
that cost is why your cost-benefit ratio will never be worth it.

> For my part, when time permits, I intend to look carefully
> at epub and try and make a version of TEI to fit it, and then
> write a small program to translate it one way and the other.

there are people doing that, if you don't want to duplicate effort.
they might love to have your help. do the research to find them.
of course, if you want to do it yourself, go ahead. i always do...
(however, i still do the research, because that's the smart thing.)

-bowerbird

p.s. if your target is project gutenberg, you should
_not_ develop your own "ultra-light" .tei, because
the guy who built their pgtei (who is their webmaster)
is _extremely_ protective of it, and will _attack_ you
if you try to suggest any alternative. (because, after all,
he's spent all these years developing it, so of course
he'll always believe he knows more about .tei than you.
he's a nasty character. don't cross him unless you dare.)

11-09-2007, 08:16 PM	#141
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	greg said: > I do not understand this at all. i'm sorry. but that's ok. :+) > I am only suggesting a lightweight standard > no more difficult than Xhtml or epub we differ on what constitutes "lightweight". > no more difficult than Xhtml or epub let me tell you the way i am framing this matter... actually, let me just send you to the place directly: > http://pgdp.net that's the website for _distributed_proofreaders_. they are the _volunteers_ who actually _digitize_ most of the project gutenberg e-texts these days. they scan books, or find scan-sets from elsewhere, do o.c.r. (or get that from elsewhere too), and then subject the o.c.r. results to proofing and formatting rounds administed right there in a web-based system. after which results are assembled by a "postprocessor" into the files that are submitted to and posted by p.g. unless you've got an idea about somebody else doing it, _these_ are the _volunteers_ who would do your markup. i emphasized that these people are _volunteers_ because they aren't getting paid to do this. they walk in the door and are put to work. they don't necessarily have training. they're motivated to work, but you can't order 'em around. you can't force 'em to do something they don't want to do. although a technoid faction has tried to entice them into doing markup in tei-light, it's been some very slow going. they spent about 6 years in various stages of "planning". (yes, you did indeed read that right, i said _six_years_...) over the last year, a few people did some 200 .tei e-texts. the thought of treating the backlog of 15,000+ e-texts hasn't even considered moving off of the back-burner... (p.g.'s count is much higher, but they have _duplicates_ and an ever-increasing percentage of non-book items, most recently .mp3 audio-books done by librevox.com.) the person who developed the brand of .tei they're using -- pgtei -- isn't interested in building any tools for them, so there hasn't been much interest from the volunteers... that's the situation as it exists today. that's how i frame it. if you frame it differently, we won't understand each other... perhaps you still consider this to be merely "hypothetical"? that's not a bad thing. but there's a reality that's here now. > As for future proofing, you seem to miss the point altogether. i'm sorry. but that's ok... :+) > I have taken at least seven or eight Shakespeare plays from Gutenberg, > turned them into word processing documents, cleaned them and > then through stylesheets reedited them and finally after a lot of effort > (plays are really hard to do compared to novels) produce a pdf > to print out copies for my students. ok, now _that_ i understand. perfectly well. my intention is to take those same plays -- and everything else -- from project gutenberg, run them through my app to convert them into z.m.l., after which they can be printed out to .pdf immediately. in other words, my converter-program does all the clean-up and formatting work that took "a lot of effort" for you to do manually. you saw -- from gutenmark -- how such a program can save time. i intend for my program to be even better than gutenmark at that. (and i _sincerely_ wish ron burkey was still developing gutenmark, because i believe a healthy competition between us would be fun, and push the state-of-the-art to a high level good for all of us...) > In short, though the tools are wrong for the purpose I have > been doing just sort of thing because I had no choice - > but the end result did not give anyone else useful. so you're aware of the same thing confronting the d.p. people, that there are no tools which can help you do what you want... i'm sorry about that. and i don't have any suggestions for you, either. except maybe to consider _why_ you don't have any tools, and how that ramifies on the wisdom of the path you're choosing... i _do_ think you are too hard on yourself when you say "the end result did not give anyone else useful", because your students _did_ get those plays in .pdf format, right? (but you might want to consider rewriting that sentence.) > I could just as well have been properly marking up the play > (using TEI derived tags) and placed back in Gutenberg > something useful to others. i'm not altogether sure how "useful" a .tei file is these days. but i'm sure you'll inform me that it _will_be_ useful later on. > I have dealt with HTML and text versions of a variety of literature. > For some purposes just being in HTML makes things very easy, > but it also can make things a lot harder as well. if you care to expand on that, i'm sure i would find it interesting. :+) > Hence I may in the future for whatever reason, > desire to quickly retrieve Chapter Seven of "Pride and Prejudice" > how can this be done unless the computer has > a means of finding exactly what I asked it to find? well, in z.m.l. it would be simple for you to specify such a request... i just made a post telling ninja bob how to find headers in a z.m.l. file. so he'd point his program at pride and prejudice, and fish out chapter 7, i.e., the chapter heading that said "chapter 7", or -- failing that -- just "7". > These people have not been wasting their time, > their precision is not useless but vital, > and their knowledge (a part of which resides > in the very code of TEI) cannot be ignored. i don't ignore their knowledge. i don't say they're "wasting their time". and i certainly don't say their precision is "useless". i think it would be _very_ useful. if only we could afford it... > And I repeat an ultr-light version of TEI > need be no more difficult than what we are already using well, "we" -- as in you and i -- aren't "already using" the same thing. .tei-light (or your "ultra-light") might not be more "difficult" than what _you_ are already using, but it's _way_ more difficult than what i'm using. but i'm not your target-audience anyway... and neither are the people here at mobileread. your target-audience is the volunteers over at distributed proofreaders. and they don't even need to be "convinced" -- they already _agree_, or at least they haven't mounted an outright revolt against pgtei -- but in order for them to actually start doing .tei, they need _tools_... so if you want them to act, find some tools, and they'll be very happy... if you don't _have_ any tools, you're just crying out in the wilderness... > For me this just places it amongst display technologies, > which is no bad place to be. The problems of text repositories > is a different thing altogether. cost-benefit ratio. that's all i can say: cost-benefit ratio. > A huge labour but one that can accumulate > over time in a systematic and reliable way. go build such a library, and prove it has a superior cost-benefit ratio. i predict you'll go broke before you ever start returning any benefits... prove me wrong. > I have serious reservations that it answers the right question. i say that you're answering the wrong question, and you're saying that i'm answering the wrong question. that's what makes a horse-race. > However, as a display technology it may well have a place. thanks and all, but this horse-race is not for second-place. it's to see who can _win_. i tell you my horse is gonna win. your horse? certainly won't finish, and might not even get out of the gate in the first place. so, does my saying that give you motivation to prove me wrong? fine. then do it... > I would ad the proviso that if it can easily translate > from epub and to epub then this would be a vital attribute > in its acceptance, especially if it is as easy as you say to code with it. i'm not even gonna turn on that capacity. i don't have to. plus doing so would only help to help out _your_horse_... do your own markup. it's important that you feel its pain. that cost is why your cost-benefit ratio will never be worth it. > For my part, when time permits, I intend to look carefully > at epub and try and make a version of TEI to fit it, and then > write a small program to translate it one way and the other. there are people doing that, if you don't want to duplicate effort. they might love to have your help. do the research to find them. of course, if you want to do it yourself, go ahead. i always do... (however, i still do the research, because that's the smart thing.) -bowerbird p.s. if your target is project gutenberg, you should _not_ develop your own "ultra-light" .tei, because the guy who built their pgtei (who is their webmaster) is _extremely_ protective of it, and will _attack_ you if you try to suggest any alternative. (because, after all, he's spent all these years developing it, so of course he'll always believe he knows more about .tei than you. he's a nasty character. don't cross him unless you dare.)