What "Cleaning Up" Do Project Gutenberg Texts Need [closed] - Page 4

bowerbird · 11-03-2007, 09:03 PM

tompe said:
> I didn't realize an authoring tool was required.

it's not "required". but it's available if you want.

or just pull your text into it every once in a while,
to make sure it behaves the way you want it to...

here's a screenshot of the authoring tool:
> http://z-m-l.com/go/zml-sandbox01.jpg

also:
> http://www.z-m-l.com/go/rieger/oya-cover.html

as you can see, one side is the text editfield,
and the other is how it will look in the viewer.

> I like to be able to write things in a text editor

me too.

> and I thought the goal was to make this possible.

that's one of the goals, yes.

but that doesn't mean we can't give people a dedicated
authoring-tool too. different strokes, and all that rot...

> If you are going to use an authoring tool
> I do not see the reason for this kind of markup.

there's a lot of utility in wysiwyg. that's why it's popular.

and in terms of _learning_ z.m.l., the authoring-tool is great.
once you've internalized the simple rule-set, you don't need it.
though wysiwyg is still nice. but if you prefer workin' blind, do...

> You said in the specification that quotes could be replaced
> to whatever the user wanted and for this you have to be able
> to distinguish between a quote and constructions like 'em.

aha, i see what you're talking about now -- curling the quotes.
yeah, it takes a little bit of magic in your coding to do it right...

when i release my program, i will enjoy seeing if you can fool it. :+)

(i'm sure you've noticed that microsoft's routines are quite brain-dead.)

> And it seems impossible to do this with your rules.

the impossible just takes a few more processing cycles... ;+)

seriously, when i say "it's done now", just type naturally,
and see if the program figures it out. if not, let me know.

if a human can puzzle it out, my routines should be able to do it too.
(of course, if it's ambiguous even to a human, then all bets are off.)

> And you have examples like: "The coordinate was 49° 12' 27" N"

if i need to (and it won't be for a clear example like this, but if i need to...),
i'll fall back to the position that z.m.l. uses utf8, so use that to disambiguate.
magic i can do. but mind-reading is something else entirely...

-bowerbird

Panurge · 11-03-2007, 11:28 PM

> 14. don't put pagenumbers inside the text/paragraphs.

For the casual reader, this may not be an important point, but for someone who publishes scholarly texts, which require documentation, it is. The page numbers of the original text do matter, as does the exact text that lies between them. I am the director of a library, and we had one of the first libraries in the country to install an automatic checkout system (in 1971 or so). When we tried to migrate from our IBM punchcards to a more up-to-date system fifteen years later, we discovered that the EBSDC coding could not be converted to ASCII (not enough computer power), and we had to re-enter every single record by hand. I can understand that no one wants to repeat this kind of conversion every time we move to new hardware and formats, hence the mild controversy over a new proposed encoding standard. But what really matters for scholars who have to show in their footnotes where to locate the authority for the text they cite, a lack of representation of the pagination of the original renders the e-text useless. Now PG has performed an outstanding service in making available many an obscure and difficult-to-find text, and the use of unadorned ASCII text, the only practical standard usable at the time it was begun, was obvious. One of the benefits of PG is its attempt to check the accuracy of the texts being transcribed. I haven't checked their efforts, but I respect the intention. The Google scanning project is a laudable one, but it is so imperfect (sloppily-executed scanning evident in far too many examples, obviously done hastily and unchecked) so that I'm afraid much will have to be redone. It's hard to get it right the first time, and even if one does, the evolution of format and hardware means that there has to be a thoughtful plan for future migration. At the same time, we who are scholars have to decide whether or not the original print text-source is what we're going to refer to or the e-text facsimile. If the latter, do we regard it as a new edition or as a faithful representation of the print copy? If we don't account for these needs in our re-encoding now, we'll simply have to redo the e-texts in the future if we expect electronic texts to gain much of a oothold in the world of scholarship and education.

jbenny · 11-03-2007, 11:44 PM

Quote:

Originally Posted by Panurge

> 14. don't put pagenumbers inside the text/paragraphs.

For the casual reader, this may not be an important point, but for someone who publishes scholarly texts, which require documentation, it is. The page numbers of the original text do matter, as does the exact text that lies between them.

You bring up a very valid point that most of us don't think of (me included). Can you suggest a way to handle this without having the page numbers in-line with the text? Most of us would find the visible page numbers too obnoxious.

For XHTML markup, one thing that comes to mind (just off the top of my head) would be to enclose all the text that makes up an original page with a surrounding tag that uses the "id" attribute to hold the page number. This would not display, but could be accessed if needed. Also, by using "id", you could construct a special hyperlinked table of pages that would allow you to jump to specific pages in the ebook. I'll have to try this and see how it works.

Using XHTML, this would work with epub and possibly Mobipocket and other formats based on HTML. Anyone have ideas on other ways to do this, either in XHTML or other formats?

jbenny · 11-04-2007, 01:07 AM

The attached Zip file is really an epub. To use it as such, rename it with an epub extension. The forum software won't let me upload it as an epub, even though it is really a zip file. You can either use the epub as is, or just unzip it and view the HTML file in your browser.

The content is totally bogus. I just made it up for this test. I used a <span> tag to mark the beginning few words of each page. Since a physical page is likely to fall mid-sentence, you can't use a block-level tag like <div>. Well, you could, but that would also break a sentence in the ebook, which is not what you want.

As for using a background color on the words that start a physical page, that isn't exactly ideal for ebook reading, either. I just did it to make it easier to see exactly where my imaginary page breaks were. Without some visual clue, you'd have to carefully scan the first few lines to match up the words, after jumping via the "table of pages" that I made at the end.

This is far from an ideal method, but it was the first thing that I tried. Perhaps someone has a better suggestion? How to delimit the page breaks for those who need them, while not being in-your-face for the average ebook reader? In a web browser, some javascript could make this a lot easier. However, I don't know of any ebook readers that do javascript (not counting PDAs).

bowerbird · 11-04-2007, 02:14 AM

panurge, i feel where you're coming from. but let me run through a few thoughts.

so first, point #14 is about the embedding of pagenumbers inside of the text flow.
that's not a good idea, because they're a distraction that just needs to be removed
when we want to copy the text out for remixing. that's why point #14 is there.

my next comment -- which i say because it must be said -- is that it's not our job
to do your job. if the pagenumbers are valuable to you, it's your job to save them.
i'm sorry if that sounds cold, but that's the way it is.

having said that, however, let me move on to my next comment, which is that
i am in 100% agreement with you. even though pagenumbers are _irrelevant_,
in many senses, when we move a book to the digital sphere, i'm convinced that
we still need to retain pagenumber information, simply because so much of our
archival history uses pagenumbers as pointer-information. we cannot afford to
sacrifice that. indeed, i go one step further and argue that we should also be
retaining the _linebreak_information_ from all the paper-books that we digitize.
i won't go into all the arguments here, but in my mind, the answer is now clear.

furthermore, i put my money where my mouth is. in my digitization examples,
i maintain linebreaks and pagebreaks, and put the image-scan up next to the text,
so the end-user can verify the accuracy of my digitization if they want to do that.
i consider this checking by end-users to be the last fine line of the proofing process,
and i want them to feel like a part of the "march to perfection" that the text makes,
because i believe we need to make the public feel like "joint owners" of these books.
"the public domain belongs to _you_, the public, and you have responsibility for them,
so if there are errors here, you need to fill out an error-report so they are corrected."

to see some of my examples, check these out:
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html

you can thumb through these e-books just like they were the p-books,
and verify that the linebreaks and pagebreaks are exactly as they were.
and if you find an error, you can fill out an error-report right on the page.
and once someone has made a report, it's immediately visible to everyone,
even if it might take an administrator a little bit of time to fix the error...

now examine the plain-text versions of the files that created those books above:
> http://z-m-l.com/go/myant/myant.zml
> http://z-m-l.com/go/mabie/mabie.zml
> http://z-m-l.com/go/sgfhb/sgfhb.zml

you'll see how the pagebreak information was recorded in those plain-text files.
i think you'll also see how easily that pagebreak information can be eliminated,
for the situations where an end-user doesn't care about the original pagebreaks.

this is the kind of flexibility we want from our digitization efforts, so each group
gets the information they like, without inconveniencing what another group gets.

what is also useful about this format is that it's extremely close to what we get
_naturally_ when we scan a book, so it's not hard to go from scan output to final.

now, having said all _that_, let me proceed to my final point, which is a variant
on the "don't expect us to do your job for you". and it is _not_ our job to make
"a faithful representation of the print copy". we don't even _want_ to do that --
even if we could -- and we _cannot_, because any time you move a document
from one medium to a completely different one, you're creating a new edition.
whether you mean to do it or not. and like i said, at least from my perspective,
i don't even think twice about things like the correcting of typos. heck, i'll even
rework headers -- or even the _body_ of the text -- if that is what it takes to
make this _digital_version_ a _good_ digital version. i'm a republisher, who is
moving this book into a new medium for a new world in a new century, and
i'm going to do justice to the new. it's simply not my job to snapshot the old.
if you want to see what the old pages looked like, you can look at the scans.

so, anyway, there's some feedback for you to think about... :+)

-bowerbird

jharker · 11-04-2007, 12:04 PM

Perhaps I'm missing something, but it seems to me that gutenmark does pretty much everything listed in the first post. In addition, it features output in LaTeX format, which means that with the right style file you can output your book with pretty much whatever formatting options you want.

How do your goals differ from gutenmark? That is, what would your program do that gutenmark doesn't?

bowerbird · 11-04-2007, 12:33 PM

my scope is to give people a full toolchain for the entire workflow,
from initial authoring through web-publishing and on into remixing.

for that, you need a good format, and authoring-tools for that format,
and viewer-programs for it, and conversion-routines to other formats.

the goal of "making a typographically beautiful e-book" is simply one
of many issues which can be incorporated into the conversion aspect.

so my scope _includes_ that, but it also goes _far_beyond_ that.

to the extent that gutenmark helps automate the .html conversion of
project gutenberg e-texts _and_ helps the output become _beautiful_,
i respect it, and i respect it greatly.

but i'm doing more than that. so, for my purposes, it's not enough.
and since ron isn't maintaining it anymore, it never will be enough...
not for me, anyway. especially since i have a rather stringent set
of requirements that i expect of any e-book viewer-program i use:
> http://onlinebooks.library.upenn.edu...t=2004-01-08,3
review my list, and observe that a web-browser falls laughably short.

if gutenmark is good for _you_ and your purposes, i'm happy for you,
and i have absolutely no desire to upset your applecart of happiness...
or if you prefer to use indesign, or word, or whatever, to make _your_
e-books beautiful, i laud you for bringing some beauty into the world...

-bowerbird

kovidgoyal · 11-04-2007, 12:43 PM

Quote:

Originally Posted by bowerbird

um, to repeat, there's a good reason why no one will write an "improved"
gutenberg-to-html converter. it's the same reason that made ron burkey
give up on gutenmark, namely, the inconsistencies riddling p.g. e-texts.

until those inconsistencies are cleaned up, a converter is a pipe-dream...

however, once those inconsistencies _are_ cleaned up, we no longer need
to _convert_ the e-texts to _any_ other format, because their consistency
will mean that viewer-programs can be made to handle their native format.

this presents the existential conundrum of heavy-markup.
until it can be applied _automatically_, its cost is too high.
but once it _can_ be applied automatically, it's unnecessary,
because the very same routines that convert text to xhtml so
that xhtml can be rendered by a display-program can instead
be put into a viewer-app that eliminates the xhtml middleman,
by working directly with the text as its input to create its output.

once you understand this, deeply, markup becomes a bad joke.

we take simple text and turn it into complicated markup, and then
we need a complicated program to handle the complicated markup
and turn it back into simple text that can be displayed. it's just silly.

once i show people markup is unnecessary, they'll laugh at you for doing it.
and i don't say that to _mock_ you; i say it so you can avoid looking stupid...

-bowerbird

On automatically converting gutenberg e-texts:

There is absolutely no reason why a converter cannot be developed that handles most of the iconsistencies correctly. Your problem seems to be that you aim for perfect conversion of all texts. That's never going to happen. And how does inventing a new lightweight markup language (when there are already tons of them out there) solve anything? The gutenberg etexts are still going to have to be converted to that markup. ANy converter written by somebody who knows what he's doing will be designed to represent semantic information internally using an object model, then adding output formats will be trivial.

On using lightweight markup in general:

1. You think of html as "heavy" markup. Not everyone is as limited.

2. I'd have no problem with lightweight markup if all I cared about was simple texts with headings a few links and some images. I don't want my documents limited to the very small set of features imposed by lightweight markup.

bowerbird · 11-04-2007, 03:24 PM

kovidgoyal said:
> There is absolutely no reason why
> a converter cannot be developed that
> handles most of the iconsistencies correctly.

i agree. in fact, i've developed that converter.

> Your problem seems to be that
> you aim for perfect conversion of all texts.

ok, here's the thing. why "handle" inconsistencies
when you can _remove_inconsistencies_entirely_?

i intend to mount a mirror of the p.g. library which
has all of their inconsistencies removed, so that
no other developers have to deal with that rubbish.

in other words, i'm doing what the "whitewashers"
at project gutenberg should have done all along,
i.e., ensured that their e-texts were _consistent_.

> That's never going to happen.

a perfect converter that handles all inconsistencies
might not happen, but we don't really need _that_.

we need a darn-good converter to clean up _most_
of them, and then we need to be _diligent_ about
finding and correcting inconsistencies that remain...

at the point where you have lots of developers who
are adding value to the library with new features
-- features that will depend on consistent e-texts --
the inconsistencies will reveal themselves naturally.

> And how does inventing a new
> lightweight markup language
> (when there are already tons of them
> out there) solve anything?

well, none of them seemed perfect enough for me.
specifically, they didn't seem "light" enough for me.
i want "zen" markup, maybe even "zero" markup...

even markdown, which is the best of the bunch,
often seems like an "abbreviated" form of markup,
and not the radical departure that i'm looking for...

and that became even more true when i factored in
the types of features that i wanted to be automatic.

for instance, i want the table of contents linked to
the chapter-headings automatically, with no work.
further, i want the chapter-headings linked _back_
to the table of contents, again without _any_ work.
plus, i want to let the users jump from one chapter
to the previous and next chapters, automatically...
even in the middle of a chapter, i want to let them
jump to the beginning of that chapter, and to the
beginning of the _next_ chapter, _automatically_...

i want a link from a footnote referent in the body
to its note in the notes section, automatically, and
i want an auto-backlink from there to the referent.
(and if there are two referents to the same note
-- it happens -- then i want auto-backlinks to both.)

and when there's a pointer-reference in the text,
such as a reference to "chapter 2", then i want for
that pointer-reference to be treated as a hotlink...

likewise, if there's a u.r.l., i want it to be a hotlink.

with the other forms of light-markup, you have to
code in all of those links manually. that's a pain...
avoiding such pain is the purpose of light-markup,
at least as far as i'm concerned. so i built my own.

plus, i did it as a puzzle, a challenge for my mind.
surely you can understand that? or maybe not...
because i just don't comprehend such questions...

> The gutenberg etexts are still going to have to
> be converted to that markup.

right. that's another reason i built my own version.
because i wanted it to be as close to "native" p.g.
as possible, to minimize the cost of bulk conversion.

as it is, the vast majority of most p.g. e-texts is
"already in" z.m.l. format. the big exception is
the front-matter at the top (e.g., the title-page).

> ANy converter written by somebody who knows
> what he's doing will be designed to represent
> semantic information internally using an object
> model, then adding output formats will be trivial.

i don't know what "an object model" is.

and frankly, i don't really care, not in the slightest,
since "adding output formats" is not a big concern.

and evidently i don't even need to know what it is,
because i've been able to do conversions just fine.

> 1. You think of html as "heavy" markup.

actually, i judge html as "medium" markup.
you have to jump to xml/css to be "heavy",
and go to .tei or docbook if you're serious.
but i dunno, maybe you are not "serious"...

> Not everyone is as limited.

nope. just 92% of the population. my user-base,
as i refer to them. i'm content to give up the rest.
heck, i'll be happy with "authors who wanna write,
and not have to waste time doing stupid markup."

> 2. I'd have no problem with lightweight markup
> if all I cared about was simple texts with
> headings a few links and some images.

evidently you haven't looked at my test-suite.

i can handle all the features commonly found in
the p.g. e-texts, and indeed in almost all books...

and when i discover a need for new capabilities,
i just invent a way for the format to handle it...
(and that's the _easy_ part. the difficult part is
coding the viewer-program for the new feature.)

and frankly, what i can't handle, i don't need...

> I don't want my documents limited to
> the very small set of features imposed by
> lightweight markup.

well, when you say _that_, you're just betraying
that you don't have a clue about light-markup...

(and, by the way, we do call it "light markup",
not "lightweight markup", because "lightweight"
implies what you are trying to say directly here,
i.e., that it is "limited" in some way, and it's not.)

markdown, for instance, lets you include _any_
(x)html code right in your markdown document
-- it just passes it on through without treating it --
so there's absolutely _nothing_ that you cannot
include, so there is no "very small set of features"
that is being "imposed" on you by the framework.

but even aside from that, the number of things
which cannot be handled within the _standard_
markdown framework is quickly vanishing away.

and if you include the additions to the standard
being implemented by stuff like multimarkdown,
you will find that you encounter no "limitations".

no offense intended, but if you want to criticize
light-markup, you will need do some homework.

-bowerbird

kovidgoyal · 11-04-2007, 04:22 PM

You say that light markup (and you use markdown as an example) can handle anything by including xhtml which means a viewer app that is designed to view a lightweight markup language will have to parse xhtml anyway to display the file. In which case any viewer app advantage in using light weight markup is negated. Incidentally I actually use markdown and have even contributed patches to the python markdown project, so try not to jump straight to the "you dont know what you're talking about" defense. It leaves me with the feeling that you dont have any real points to make.

As for authors not wanting to learn markup. Those that are too lazy to learn markup will be too lazy to learn lightweight markup as well. They will demand a WYSWYG GUI to take care of the markup for them.

You have the attitude that creating a markup language that is just sufficient for all of todays needs is the right approach. You'll then "add more features" as you see the need. But it's not easy to "add features" to a lightweight markup language. Case in point is markdown and how you have to jump to html for any advanced features.

So yes, it is more effort to develop applications for authoring/converting/viewing a "heavy" markup language, but in the end its worth it. To say that we must limit ourselves to a lightweight language simply because developing applications for a heavy language is too difficult, is ridiculous. Let me leave you with the example of TeX. A publishing system that is not lightweight and that has lasted decades.

Lightweight markup is a good fit for gutenberg, but little else. And even there, I suspect they'd have a hard time getting their digitizers to follow the rules. As far as creating modern digital books, there is really no reason to be restricted to a lightweight markup language. And note that I continue to call it lightweight, because that is precisely what it is.

You say you want to maintain a mirror of gutenberg. An excellent idea. If you support export of gutenberg texts to HTML, I might even use it

kovidgoyal · 11-04-2007, 04:24 PM

And if you've developed a converter, do you mind releasing it to the public, so that we can use it to convert gutenberg texts and see how well it does for ourselves?

jbenny · 11-04-2007, 04:57 PM

I don't think it is accomplishing anything by replying to bowerbird's posts with questions and reasonable arguments. No matter what anyone says, his replies generally discount what anyone else says and accuses them of not knowing what they are talking about. He doesn't seem to be open to discussion or suggestions, but only in promoting his own way of doing things. I'm sure he will reply, denying this (and probably insult me in the process). However, his posts are the best evidence in support of my assertion.

One recent post in particular that illustrates bowerbird's low opinion of everyone who contributes to this forum (which I find full of very useful information): https://www.mobileread.com/forums/sho...5&postcount=56

bowerbird · 11-04-2007, 06:15 PM

kovidgoyal said:
> You say that light markup (and you use markdown as an example)
> can handle anything by including xhtml which means a viewer app
> that is designed to view a lightweight markup language will have to
> parse xhtml anyway to display the file. In which case any viewer app
> advantage in using light weight markup is negated.

we seem to be talking past each other.

every light-markup system -- with the _exception_ of mine -- is
geared toward creating output formatted for an external viewer...

in some cases it's docbook, or .tei, or latex, but -- most usually --
it's (x)html, and it's aimed squarely at a web-browser as the agent.

so if you make a general statement about light-markup systems,
it will be interpreted with that understanding. if you want to say
they're "limiting", you're saying they're limiting _in_that_sphere_.

except markdown -- runaway market-leader in the genre -- has
_no_ limitations in that regard, since it can contain _any_ (x)html.

if you want to poke accusations at _my_ particular light-markup,
in the form of a claim that it cannot support every (x)html feature,
then you would be absolutely correct. but if that's what you meant,
then you should have said _that_.

and, in case i haven't said it before, or said it directly enough yet,
my particular system is aimed squarely at use for electronic-books.
i will support all the features needed by e-books, but nothing more.
and i'm aiming z.m.l. at _my_ viewer-program, not at a web-browser.

but heck yes, i can pass through (x)html just as good as the next format.
so if someone wants to use z.m.l. to target a web-browser, via the .html
conversion ability, then go ahead and include whatever (x)html you want.

so i'm still seeing absolutely no substance to your point. none at all.
but maybe we're still talking past each other... proceed if you wish...

> Incidentally I actually use markdown and have even contributed patches
> to the python markdown project

so then why did you say what you did, which was highly misleading?
you must have known it bordered on totally false when you said it...

> As for authors not wanting to learn markup. Those that are too lazy
> to learn markup will be too lazy to learn lightweight markup as well.
> They will demand a WYSWYG GUI to take care of the markup for them.

did you not read in this very thread where i said i'll give them wysiwyg?

> You have the attitude that creating a markup language that is
> just sufficient for all of todays needs is the right approach.

well, as i said above, the needs of _e-books_ in particular, and that's it.

> You'll then "add more features" as you see the need.
> But it's not easy to "add features" to a lightweight markup language.

well, i just disagree with you about the difficulty of adding features.
and since that's _my_ problem and not your problem, we don't need
to go back and forth about it. it's a "difficulty" i'm willing to handle...

but the fact is, i've done a lot of work up-front to make sure that
i was knowledgable about the features that i would actually _need_.
that's why i devised a test-suite. and i've lived with it for two years,
and i've convinced myself that it's sufficiently complete for the job...

(there _is_ stuff that might not be completely visible on its surface,
but i haven't yet put it in because i want to learn which observer is
smart enough to see the "shortcomings" and draw attention to 'em.)

moreover, i did the work of specifying the features that i demand of
my ideal e-book viewer-app, so i know what my format needs to do:
> http://onlinebooks.library.upenn.edu...t=2004-01-08,3

given my preparation on both sides of the equation, i feel i'm covered.
i've also looked at a very large number of paper-books over the years,
so i'm quite confident i'm aware of the sphere of things that's needed.

> Case in point is markdown and how you have to
> jump to html for any advanced features.

i was years into development of z.m.l. before markdown even started.
and i am moving slower than they are, with more advance planning...
that means i can benefit from their experience, and i certainly have...

i also have the advantage that my scope is narrower than their scope,
in that my arena is electronic-books. at the same time, i have concerns
they do not have, including file-format interaction with my viewer-app.
so you really can't generalize from their experience to mine... sorry...

but i would also say you're misconstruing their situation just a little...
it's not my sense they had to "jump to html" for "advanced features".
i think they deliberately put in that option early, to retain simplicity.

> So yes, it is more effort to develop applications for
> authoring/converting/viewing a "heavy" markup language,
> but in the end its worth it.

hey, i'm glad you feel that way, so _you_ will bear the costs of that,
and _other_people_ -- perhaps even me! -- will accrue the benefits.

likewise, if other people are willing to pay the costs of heavy-markup,
then i have no objection to it. (except maybe a general dislike for cruft;
but, you know, if it gives me a ton of benefits, i can even live with that.)

it's only when _i_ have to pay the price of doing heavy-markup that i balk.

and, you know, i have been waiting for the heavy-markup advocates over
on the p.g. listserve to start marking up the e-texts for over 4 years now,
and they're still just as uncoordinated about the task as they've ever been.
indeed, they seem totally unwilling to do the job themselves, and instead
seem bent on trying to "convince" the p.g. volunteers to do it for them!...

needless to say, the volunteers aren't eager to pick up this complex task.

> To say that we must limit ourselves to a lightweight language
> simply because developing applications for a heavy language
> is too difficult, is ridiculous.

well, then, you know, maybe you should trot over to the p.g. listserves,
or maybe the d.p. forums, because they keep moaning the lack of tools
that would help 'em take on the complicated job of doing heavy-markup.

because you make it sound easy...

> Lightweight markup is a good fit for gutenberg, but little else.

well, yes and no. it's gonna be good for project gutenberg e-texts...

if i didn't believe that, i would not have put in several years of work...
and i certainly wouldn't be willing to convert the whole catalog myself,
and spend my time and energy on maintaining an independent mirror.

but i don't believe it'll be "a good fit" for "little else". indeed, i'm viewing
project gutenberg's corpus as mere "proof of concept" for a cyberlibrary
composed of the _tens_of_millions_ of books that google is now scanning.
i don't intend on maintaining _that_ myself, just giving them a good model.

> And even there, I suspect they'd have a hard time getting their digitizers
> to follow the rules.

are you reading my messages here? as i said before, most of the text in
almost all of the project gutenberg e-texts is _already_ in z.m.l. format...
that is, they are already "following the rules"...

there are usually a few inconsistencies in each one, which my routines
can find and fix -- automatically, for the most part -- so i'm satisfied...

now, of course, it would be far better if p.g. tracked down the glitches,
so _their_ versions would be completely consistent as well, but oh well,
at least i know mine will be. and other developers will learn that too...

> As far as creating modern digital books, there is really no reason
> to be restricted to a lightweight markup language. And note that
> I continue to call it lightweight, because that is precisely what it is.

perhaps you misunderstood... i just told you what _we_ call it, and why.
i don't really care what you call it. it doesn't really care what you call it.

and i don't care if you imply it is limited. or even if you say that directly.

as long as it does what _i_ want it to, the things _i_ consider necessary,
i will be happy with it. and i'm certain others will be happy with it too...
especially those authors who don't wanna waste any time doing markup.

> You say you want to maintain a mirror of gutenberg. An excellent idea.
> If you support export of gutenberg texts to HTML, I might even use it

of course i'll support conversion of my files to .html. and of course people
will use it. but they'll quickly learn that conversion is an unnecessary step,
because e-texts in the native z.m.l. format are a better e-book experience,
thanks to the high-powered z.m.l. viewer-program...

> And if you've developed a converter,
> do you mind releasing it to the public,
> so that we can use it to convert gutenberg texts
> and see how well it does for ourselves?

well, yeah, actually i _do_ mind "releasing it to the public".
i have no intention of releasing any source code, thank you.
but it's available for sale, with a price in the 6-figure range...

however, you will receive the _fruits_ of the conversion process
-- in the form of totally consistent e-texts in my z.m.l. format --
when i mount my mirror. but that'll be sometime down the line,
because part of that job involves reformatting of the front-matter.

and -- for the 4th time now -- if you want to "see how well it does",
then just visit the web-page that i gave up at the top of this page:
> http://z-m-l.com/go/vl3.pl
> http://z-m-l.com/go/zmldingus093.pl

the second of those two is a "live" converter, which you can use to
convert a project gutenberg e-text if you like. you'll have to clean it
up a bit first -- so that it's in z.m.l. format -- but then it'll work ok...
with, of course, the caveats i've given all along -- "in-progress", etc.

-bowerbird

bowerbird · 11-04-2007, 06:23 PM

jbenny said:
> He doesn't seem to be open to discussion or suggestions,
> but only in promoting his own way of doing things.

what, precisely, is it that you think you have "taught" me?

i've been working on this for many _years_ now, and i know
what my system does. and you've got -- at the very best! --
a sketchy understanding. yet somehow, you think you can
come up with something that i haven't considered? my word...

heavy-markup advocates like yourself have been attacking me
from the very first time i ever uttered a word about this system,
and they've stayed in attack-mode for -- quite literally -- _years_,
and yet you think you've come up with something unique? what?

that's rich. i mean, that's really _rich_...

this is a serious question: what is it you think i've "discounted"?

-bowerbird

kovidgoyal · 11-04-2007, 07:56 PM

To re-iterate my points which you haven't answered in your rather rambling response:

1) Light markup has minimal features. If you add more features your viewer apps will become more complex anyway. That negates your viewer argument. Heavy markup is heavy for a reason, it supports features. A design philosophy that limits features in order to improve program simplicity is the wrong approach in these times of ever increasing CPU power.

2) If authors use a GUI to generate ebooks, then they don't care about the markup, which then negates your argument for lightweight markup from the perspective of authors.

3) Lightweight markup is suitable for people who digitize books (like p.g.) but not for people who create books, since people who digitize/convert books typically don't care about advanced features, while people who create them do.

Some new points:
1) If you aren't open sourcing your code then good bye and good luck. All you're doing then is defining a specification. Any 10 year old that spends a week thinking about the requirements for an ebook format could do that.

2) Considering that you are designing a limited specification with closed source authoring/viewing software support for changes to that format (which will have to be made over time) will be spotty at best.

Finally:

When it comes to designing format converters, the key is the output format.
If you choose an output format that is a superset of all input formats you might consider, it is then possible to use the converter to convert all input formats to a single output format. You do this by using a object model internally in the converter software, with plugins for input formats. And it them becomes easy to output to different formats using the object model.

Starting with an output format that is more limited than possible input formats is simply ass-backwards. As I said before zml *might* be a good idea for conversion of txt files for p.g. but little else. And without an opensource converter from zml to html it is emphatically not a good idea.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The "Closed Circle" is open for business	pholy	Deals and Resources (No Self-Promotion or Affiliate Links)	0	12-20-2009 09:24 PM
"SuperBook" project - British School studies e-books usage	TadW	News	2	06-28-2007 10:46 PM
Introducing the book: Gutenberg offers "in-home" tech support (humor)	nekokami	Lounge	1	05-07-2007 08:40 PM
"Gutenberg 2.0: le futur du livre" / iRex demoes Mobipocket on iLiad	Hadrien	News	4	03-27-2007 11:45 AM

11-03-2007, 09:03 PM	#46
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	tompe said: > I didn't realize an authoring tool was required. it's not "required". but it's available if you want. or just pull your text into it every once in a while, to make sure it behaves the way you want it to... here's a screenshot of the authoring tool: > http://z-m-l.com/go/zml-sandbox01.jpg also: > http://www.z-m-l.com/go/rieger/oya-cover.html as you can see, one side is the text editfield, and the other is how it will look in the viewer. > I like to be able to write things in a text editor me too. > and I thought the goal was to make this possible. that's one of the goals, yes. but that doesn't mean we can't give people a dedicated authoring-tool too. different strokes, and all that rot... > If you are going to use an authoring tool > I do not see the reason for this kind of markup. there's a lot of utility in wysiwyg. that's why it's popular. and in terms of _learning_ z.m.l., the authoring-tool is great. once you've internalized the simple rule-set, you don't need it. though wysiwyg is still nice. but if you prefer workin' blind, do... > You said in the specification that quotes could be replaced > to whatever the user wanted and for this you have to be able > to distinguish between a quote and constructions like 'em. aha, i see what you're talking about now -- curling the quotes. yeah, it takes a little bit of magic in your coding to do it right... when i release my program, i will enjoy seeing if you can fool it. :+) (i'm sure you've noticed that microsoft's routines are quite brain-dead.) > And it seems impossible to do this with your rules. the impossible just takes a few more processing cycles... ;+) seriously, when i say "it's done now", just type naturally, and see if the program figures it out. if not, let me know. if a human can puzzle it out, my routines should be able to do it too. (of course, if it's ambiguous even to a human, then all bets are off.) > And you have examples like: "The coordinate was 49° 12' 27" N" if i need to (and it won't be for a clear example like this, but if i need to...), i'll fall back to the position that z.m.l. uses utf8, so use that to disambiguate. magic i can do. but mind-reading is something else entirely... -bowerbird

11-03-2007, 11:28 PM	#47
Panurge Enthusiast Posts: 34 Karma: 336 Join Date: Dec 2006 Location: Texas Device: Sony Reader	> 14. don't put pagenumbers inside the text/paragraphs. For the casual reader, this may not be an important point, but for someone who publishes scholarly texts, which require documentation, it is. The page numbers of the original text do matter, as does the exact text that lies between them. I am the director of a library, and we had one of the first libraries in the country to install an automatic checkout system (in 1971 or so). When we tried to migrate from our IBM punchcards to a more up-to-date system fifteen years later, we discovered that the EBSDC coding could not be converted to ASCII (not enough computer power), and we had to re-enter every single record by hand. I can understand that no one wants to repeat this kind of conversion every time we move to new hardware and formats, hence the mild controversy over a new proposed encoding standard. But what really matters for scholars who have to show in their footnotes where to locate the authority for the text they cite, a lack of representation of the pagination of the original renders the e-text useless. Now PG has performed an outstanding service in making available many an obscure and difficult-to-find text, and the use of unadorned ASCII text, the only practical standard usable at the time it was begun, was obvious. One of the benefits of PG is its attempt to check the accuracy of the texts being transcribed. I haven't checked their efforts, but I respect the intention. The Google scanning project is a laudable one, but it is so imperfect (sloppily-executed scanning evident in far too many examples, obviously done hastily and unchecked) so that I'm afraid much will have to be redone. It's hard to get it right the first time, and even if one does, the evolution of format and hardware means that there has to be a thoughtful plan for future migration. At the same time, we who are scholars have to decide whether or not the original print text-source is what we're going to refer to or the e-text facsimile. If the latter, do we regard it as a new edition or as a faithful representation of the print copy? If we don't account for these needs in our re-encoding now, we'll simply have to redo the e-texts in the future if we expect electronic texts to gain much of a oothold in the world of scholarship and education.

11-04-2007, 02:14 AM	#50
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	panurge, i feel where you're coming from. but let me run through a few thoughts. so first, point #14 is about the embedding of pagenumbers inside of the text flow. that's not a good idea, because they're a distraction that just needs to be removed when we want to copy the text out for remixing. that's why point #14 is there. my next comment -- which i say because it must be said -- is that it's not our job to do your job. if the pagenumbers are valuable to you, it's your job to save them. i'm sorry if that sounds cold, but that's the way it is. having said that, however, let me move on to my next comment, which is that i am in 100% agreement with you. even though pagenumbers are _irrelevant_, in many senses, when we move a book to the digital sphere, i'm convinced that we still need to retain pagenumber information, simply because so much of our archival history uses pagenumbers as pointer-information. we cannot afford to sacrifice that. indeed, i go one step further and argue that we should also be retaining the _linebreak_information_ from all the paper-books that we digitize. i won't go into all the arguments here, but in my mind, the answer is now clear. furthermore, i put my money where my mouth is. in my digitization examples, i maintain linebreaks and pagebreaks, and put the image-scan up next to the text, so the end-user can verify the accuracy of my digitization if they want to do that. i consider this checking by end-users to be the last fine line of the proofing process, and i want them to feel like a part of the "march to perfection" that the text makes, because i believe we need to make the public feel like "joint owners" of these books. "the public domain belongs to _you_, the public, and you have responsibility for them, so if there are errors here, you need to fill out an error-report so they are corrected." to see some of my examples, check these out: > http://z-m-l.com/go/myant/myantp001.html > http://z-m-l.com/go/mabie/mabiep001.html > http://z-m-l.com/go/sgfhb/sgfhbp001.html you can thumb through these e-books just like they were the p-books, and verify that the linebreaks and pagebreaks are exactly as they were. and if you find an error, you can fill out an error-report right on the page. and once someone has made a report, it's immediately visible to everyone, even if it might take an administrator a little bit of time to fix the error... now examine the plain-text versions of the files that created those books above: > http://z-m-l.com/go/myant/myant.zml > http://z-m-l.com/go/mabie/mabie.zml > http://z-m-l.com/go/sgfhb/sgfhb.zml you'll see how the pagebreak information was recorded in those plain-text files. i think you'll also see how easily that pagebreak information can be eliminated, for the situations where an end-user doesn't care about the original pagebreaks. this is the kind of flexibility we want from our digitization efforts, so each group gets the information they like, without inconveniencing what another group gets. what is also useful about this format is that it's extremely close to what we get _naturally_ when we scan a book, so it's not hard to go from scan output to final. now, having said all _that_, let me proceed to my final point, which is a variant on the "don't expect us to do your job for you". and it is _not_ our job to make "a faithful representation of the print copy". we don't even _want_ to do that -- even if we could -- and we _cannot_, because any time you move a document from one medium to a completely different one, you're creating a new edition. whether you mean to do it or not. and like i said, at least from my perspective, i don't even think twice about things like the correcting of typos. heck, i'll even rework headers -- or even the _body_ of the text -- if that is what it takes to make this _digital_version_ a _good_ digital version. i'm a republisher, who is moving this book into a new medium for a new world in a new century, and i'm going to do justice to the new. it's simply not my job to snapshot the old. if you want to see what the old pages looked like, you can look at the scans. so, anyway, there's some feedback for you to think about... :+) -bowerbird

11-04-2007, 12:04 PM	#51
jharker Developer Posts: 345 Karma: 3473 Join Date: Apr 2007 Location: Brooklyn, NY, USA Device: iRex iLiad v1, Blackberry Tour, Kindle DX, iPad.	Perhaps I'm missing something, but it seems to me that gutenmark does pretty much everything listed in the first post. In addition, it features output in LaTeX format, which means that with the right style file you can output your book with pretty much whatever formatting options you want. How do your goals differ from gutenmark? That is, what would your program do that gutenmark doesn't?

11-04-2007, 12:33 PM	#52
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	my scope is to give people a full toolchain for the entire workflow, from initial authoring through web-publishing and on into remixing. for that, you need a good format, and authoring-tools for that format, and viewer-programs for it, and conversion-routines to other formats. the goal of "making a typographically beautiful e-book" is simply one of many issues which can be incorporated into the conversion aspect. so my scope _includes_ that, but it also goes _far_beyond_ that. to the extent that gutenmark helps automate the .html conversion of project gutenberg e-texts _and_ helps the output become _beautiful_, i respect it, and i respect it greatly. but i'm doing more than that. so, for my purposes, it's not enough. and since ron isn't maintaining it anymore, it never will be enough... not for me, anyway. especially since i have a rather stringent set of requirements that i expect of any e-book viewer-program i use: > http://onlinebooks.library.upenn.edu...t=2004-01-08,3 review my list, and observe that a web-browser falls laughably short. if gutenmark is good for _you_ and your purposes, i'm happy for you, and i have absolutely no desire to upset your applecart of happiness... or if you prefer to use indesign, or word, or whatever, to make _your_ e-books beautiful, i laud you for bringing some beauty into the world... -bowerbird

11-04-2007, 03:24 PM	#54
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	kovidgoyal said: > There is absolutely no reason why > a converter cannot be developed that > handles most of the iconsistencies correctly. i agree. in fact, i've developed that converter. > Your problem seems to be that > you aim for perfect conversion of all texts. ok, here's the thing. why "handle" inconsistencies when you can _remove_inconsistencies_entirely_? i intend to mount a mirror of the p.g. library which has all of their inconsistencies removed, so that no other developers have to deal with that rubbish. in other words, i'm doing what the "whitewashers" at project gutenberg should have done all along, i.e., ensured that their e-texts were _consistent_. > That's never going to happen. a perfect converter that handles all inconsistencies might not happen, but we don't really need _that_. we need a darn-good converter to clean up _most_ of them, and then we need to be _diligent_ about finding and correcting inconsistencies that remain... at the point where you have lots of developers who are adding value to the library with new features -- features that will depend on consistent e-texts -- the inconsistencies will reveal themselves naturally. > And how does inventing a new > lightweight markup language > (when there are already tons of them > out there) solve anything? well, none of them seemed perfect enough for me. specifically, they didn't seem "light" enough for me. i want "zen" markup, maybe even "zero" markup... even markdown, which is the best of the bunch, often seems like an "abbreviated" form of markup, and not the radical departure that i'm looking for... and that became even more true when i factored in the types of features that i wanted to be automatic. for instance, i want the table of contents linked to the chapter-headings automatically, with no work. further, i want the chapter-headings linked _back_ to the table of contents, again without _any_ work. plus, i want to let the users jump from one chapter to the previous and next chapters, automatically... even in the middle of a chapter, i want to let them jump to the beginning of that chapter, and to the beginning of the _next_ chapter, _automatically_... i want a link from a footnote referent in the body to its note in the notes section, automatically, and i want an auto-backlink from there to the referent. (and if there are two referents to the same note -- it happens -- then i want auto-backlinks to both.) and when there's a pointer-reference in the text, such as a reference to "chapter 2", then i want for that pointer-reference to be treated as a hotlink... likewise, if there's a u.r.l., i want it to be a hotlink. with the other forms of light-markup, you have to code in all of those links manually. that's a pain... avoiding such pain is the purpose of light-markup, at least as far as i'm concerned. so i built my own. plus, i did it as a puzzle, a challenge for my mind. surely you can understand that? or maybe not... because i just don't comprehend such questions... > The gutenberg etexts are still going to have to > be converted to that markup. right. that's another reason i built my own version. because i wanted it to be as close to "native" p.g. as possible, to minimize the cost of bulk conversion. as it is, the vast majority of most p.g. e-texts is "already in" z.m.l. format. the big exception is the front-matter at the top (e.g., the title-page). > ANy converter written by somebody who knows > what he's doing will be designed to represent > semantic information internally using an object > model, then adding output formats will be trivial. i don't know what "an object model" is. and frankly, i don't really care, not in the slightest, since "adding output formats" is not a big concern. and evidently i don't even need to know what it is, because i've been able to do conversions just fine. > 1. You think of html as "heavy" markup. actually, i judge html as "medium" markup. you have to jump to xml/css to be "heavy", and go to .tei or docbook if you're serious. but i dunno, maybe you are not "serious"... > Not everyone is as limited. nope. just 92% of the population. my user-base, as i refer to them. i'm content to give up the rest. heck, i'll be happy with "authors who wanna write, and not have to waste time doing stupid markup." > 2. I'd have no problem with lightweight markup > if all I cared about was simple texts with > headings a few links and some images. evidently you haven't looked at my test-suite. i can handle all the features commonly found in the p.g. e-texts, and indeed in almost all books... and when i discover a need for new capabilities, i just invent a way for the format to handle it... (and that's the _easy_ part. the difficult part is coding the viewer-program for the new feature.) and frankly, what i can't handle, i don't need... > I don't want my documents limited to > the very small set of features imposed by > lightweight markup. well, when you say _that_, you're just betraying that you don't have a clue about light-markup... (and, by the way, we do call it "light markup", not "lightweight markup", because "lightweight" implies what you are trying to say directly here, i.e., that it is "limited" in some way, and it's not.) markdown, for instance, lets you include _any_ (x)html code right in your markdown document -- it just passes it on through without treating it -- so there's absolutely _nothing_ that you cannot include, so there is no "very small set of features" that is being "imposed" on you by the framework. but even aside from that, the number of things which cannot be handled within the _standard_ markdown framework is quickly vanishing away. and if you include the additions to the standard being implemented by stuff like multimarkdown, you will find that you encounter no "limitations". no offense intended, but if you want to criticize light-markup, you will need do some homework. -bowerbird

11-04-2007, 04:22 PM	#55
kovidgoyal creator of calibre Posts: 43,881 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You say that light markup (and you use markdown as an example) can handle anything by including xhtml which means a viewer app that is designed to view a lightweight markup language will have to parse xhtml anyway to display the file. In which case any viewer app advantage in using light weight markup is negated. Incidentally I actually use markdown and have even contributed patches to the python markdown project, so try not to jump straight to the "you dont know what you're talking about" defense. It leaves me with the feeling that you dont have any real points to make. As for authors not wanting to learn markup. Those that are too lazy to learn markup will be too lazy to learn lightweight markup as well. They will demand a WYSWYG GUI to take care of the markup for them. You have the attitude that creating a markup language that is just sufficient for all of todays needs is the right approach. You'll then "add more features" as you see the need. But it's not easy to "add features" to a lightweight markup language. Case in point is markdown and how you have to jump to html for any advanced features. So yes, it is more effort to develop applications for authoring/converting/viewing a "heavy" markup language, but in the end its worth it. To say that we must limit ourselves to a lightweight language simply because developing applications for a heavy language is too difficult, is ridiculous. Let me leave you with the example of TeX. A publishing system that is not lightweight and that has lasted decades. Lightweight markup is a good fit for gutenberg, but little else. And even there, I suspect they'd have a hard time getting their digitizers to follow the rules. As far as creating modern digital books, there is really no reason to be restricted to a lightweight markup language. And note that I continue to call it lightweight, because that is precisely what it is. You say you want to maintain a mirror of gutenberg. An excellent idea. If you support export of gutenberg texts to HTML, I might even use it

11-04-2007, 04:24 PM	#56
kovidgoyal creator of calibre Posts: 43,881 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	And if you've developed a converter, do you mind releasing it to the public, so that we can use it to convert gutenberg texts and see how well it does for ourselves?

11-04-2007, 04:57 PM	#57
jbenny Addict Posts: 323 Karma: 358 Join Date: May 2007 Device: Tablet PC and Nokia N800	I don't think it is accomplishing anything by replying to bowerbird's posts with questions and reasonable arguments. No matter what anyone says, his replies generally discount what anyone else says and accuses them of not knowing what they are talking about. He doesn't seem to be open to discussion or suggestions, but only in promoting his own way of doing things. I'm sure he will reply, denying this (and probably insult me in the process). However, his posts are the best evidence in support of my assertion. One recent post in particular that illustrates bowerbird's low opinion of everyone who contributes to this forum (which I find full of very useful information): https://www.mobileread.com/forums/sho...5&postcount=56

11-04-2007, 06:15 PM	#58
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	kovidgoyal said: > You say that light markup (and you use markdown as an example) > can handle anything by including xhtml which means a viewer app > that is designed to view a lightweight markup language will have to > parse xhtml anyway to display the file. In which case any viewer app > advantage in using light weight markup is negated. we seem to be talking past each other. every light-markup system -- with the _exception_ of mine -- is geared toward creating output formatted for an external viewer... in some cases it's docbook, or .tei, or latex, but -- most usually -- it's (x)html, and it's aimed squarely at a web-browser as the agent. so if you make a general statement about light-markup systems, it will be interpreted with that understanding. if you want to say they're "limiting", you're saying they're limiting _in_that_sphere_. except markdown -- runaway market-leader in the genre -- has _no_ limitations in that regard, since it can contain _any_ (x)html. if you want to poke accusations at _my_ particular light-markup, in the form of a claim that it cannot support every (x)html feature, then you would be absolutely correct. but if that's what you meant, then you should have said _that_. and, in case i haven't said it before, or said it directly enough yet, my particular system is aimed squarely at use for electronic-books. i will support all the features needed by e-books, but nothing more. and i'm aiming z.m.l. at _my_ viewer-program, not at a web-browser. but heck yes, i can pass through (x)html just as good as the next format. so if someone wants to use z.m.l. to target a web-browser, via the .html conversion ability, then go ahead and include whatever (x)html you want. so i'm still seeing absolutely no substance to your point. none at all. but maybe we're still talking past each other... proceed if you wish... > Incidentally I actually use markdown and have even contributed patches > to the python markdown project so then why did you say what you did, which was highly misleading? you must have known it bordered on totally false when you said it... > As for authors not wanting to learn markup. Those that are too lazy > to learn markup will be too lazy to learn lightweight markup as well. > They will demand a WYSWYG GUI to take care of the markup for them. did you not read in this very thread where i said i'll give them wysiwyg? > You have the attitude that creating a markup language that is > just sufficient for all of todays needs is the right approach. well, as i said above, the needs of _e-books_ in particular, and that's it. > You'll then "add more features" as you see the need. > But it's not easy to "add features" to a lightweight markup language. well, i just disagree with you about the difficulty of adding features. and since that's _my_ problem and not your problem, we don't need to go back and forth about it. it's a "difficulty" i'm willing to handle... but the fact is, i've done a lot of work up-front to make sure that i was knowledgable about the features that i would actually _need_. that's why i devised a test-suite. and i've lived with it for two years, and i've convinced myself that it's sufficiently complete for the job... (there _is_ stuff that might not be completely visible on its surface, but i haven't yet put it in because i want to learn which observer is smart enough to see the "shortcomings" and draw attention to 'em.) moreover, i did the work of specifying the features that i demand of my ideal e-book viewer-app, so i know what my format needs to do: > http://onlinebooks.library.upenn.edu...t=2004-01-08,3 given my preparation on both sides of the equation, i feel i'm covered. i've also looked at a very large number of paper-books over the years, so i'm quite confident i'm aware of the sphere of things that's needed. > Case in point is markdown and how you have to > jump to html for any advanced features. i was years into development of z.m.l. before markdown even started. and i am moving slower than they are, with more advance planning... that means i can benefit from their experience, and i certainly have... i also have the advantage that my scope is narrower than their scope, in that my arena is electronic-books. at the same time, i have concerns they do not have, including file-format interaction with my viewer-app. so you really can't generalize from their experience to mine... sorry... but i would also say you're misconstruing their situation just a little... it's not my sense they had to "jump to html" for "advanced features". i think they deliberately put in that option early, to retain simplicity. > So yes, it is more effort to develop applications for > authoring/converting/viewing a "heavy" markup language, > but in the end its worth it. hey, i'm glad you feel that way, so _you_ will bear the costs of that, and _other_people_ -- perhaps even me! -- will accrue the benefits. likewise, if other people are willing to pay the costs of heavy-markup, then i have no objection to it. (except maybe a general dislike for cruft; but, you know, if it gives me a ton of benefits, i can even live with that.) it's only when _i_ have to pay the price of doing heavy-markup that i balk. and, you know, i have been waiting for the heavy-markup advocates over on the p.g. listserve to start marking up the e-texts for over 4 years now, and they're still just as uncoordinated about the task as they've ever been. indeed, they seem totally unwilling to do the job themselves, and instead seem bent on trying to "convince" the p.g. volunteers to do it for them!... needless to say, the volunteers aren't eager to pick up this complex task. > To say that we must limit ourselves to a lightweight language > simply because developing applications for a heavy language > is too difficult, is ridiculous. well, then, you know, maybe you should trot over to the p.g. listserves, or maybe the d.p. forums, because they keep moaning the lack of tools that would help 'em take on the complicated job of doing heavy-markup. because you make it sound easy... > Lightweight markup is a good fit for gutenberg, but little else. well, yes and no. it's gonna be good for project gutenberg e-texts... if i didn't believe that, i would not have put in several years of work... and i certainly wouldn't be willing to convert the whole catalog myself, and spend my time and energy on maintaining an independent mirror. but i don't believe it'll be "a good fit" for "little else". indeed, i'm viewing project gutenberg's corpus as mere "proof of concept" for a cyberlibrary composed of the _tens_of_millions_ of books that google is now scanning. i don't intend on maintaining _that_ myself, just giving them a good model. > And even there, I suspect they'd have a hard time getting their digitizers > to follow the rules. are you reading my messages here? as i said before, most of the text in almost all of the project gutenberg e-texts is _already_ in z.m.l. format... that is, they are already "following the rules"... there are usually a few inconsistencies in each one, which my routines can find and fix -- automatically, for the most part -- so i'm satisfied... now, of course, it would be far better if p.g. tracked down the glitches, so _their_ versions would be completely consistent as well, but oh well, at least i know mine will be. and other developers will learn that too... > As far as creating modern digital books, there is really no reason > to be restricted to a lightweight markup language. And note that > I continue to call it lightweight, because that is precisely what it is. perhaps you misunderstood... i just told you what _we_ call it, and why. i don't really care what you call it. it doesn't really care what you call it. and i don't care if you imply it is limited. or even if you say that directly. as long as it does what _i_ want it to, the things _i_ consider necessary, i will be happy with it. and i'm certain others will be happy with it too... especially those authors who don't wanna waste any time doing markup. > You say you want to maintain a mirror of gutenberg. An excellent idea. > If you support export of gutenberg texts to HTML, I might even use it of course i'll support conversion of my files to .html. and of course people will use it. but they'll quickly learn that conversion is an unnecessary step, because e-texts in the native z.m.l. format are a better e-book experience, thanks to the high-powered z.m.l. viewer-program... > And if you've developed a converter, > do you mind releasing it to the public, > so that we can use it to convert gutenberg texts > and see how well it does for ourselves? well, yeah, actually i _do_ mind "releasing it to the public". i have no intention of releasing any source code, thank you. but it's available for sale, with a price in the 6-figure range... however, you will receive the _fruits_ of the conversion process -- in the form of totally consistent e-texts in my z.m.l. format -- when i mount my mirror. but that'll be sometime down the line, because part of that job involves reformatting of the front-matter. and -- for the 4th time now -- if you want to "see how well it does", then just visit the web-page that i gave up at the top of this page: > http://z-m-l.com/go/vl3.pl > http://z-m-l.com/go/zmldingus093.pl the second of those two is a "live" converter, which you can use to convert a project gutenberg e-text if you like. you'll have to clean it up a bit first -- so that it's in z.m.l. format -- but then it'll work ok... with, of course, the caveats i've given all along -- "in-progress", etc. -bowerbird

Advert

Advert

11-04-2007, 06:23 PM	#59
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	jbenny said: > He doesn't seem to be open to discussion or suggestions, > but only in promoting his own way of doing things. what, precisely, is it that you think you have "taught" me? i've been working on this for many _years_ now, and i know what my system does. and you've got -- at the very best! -- a sketchy understanding. yet somehow, you think you can come up with something that i haven't considered? my word... heavy-markup advocates like yourself have been attacking me from the very first time i ever uttered a word about this system, and they've stayed in attack-mode for -- quite literally -- _years_, and yet you think you've come up with something unique? what? that's rich. i mean, that's really _rich_... this is a serious question: what is it you think i've "discounted"? -bowerbird

11-04-2007, 07:56 PM	#60
kovidgoyal creator of calibre Posts: 43,881 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	To re-iterate my points which you haven't answered in your rather rambling response: 1) Light markup has minimal features. If you add more features your viewer apps will become more complex anyway. That negates your viewer argument. Heavy markup is heavy for a reason, it supports features. A design philosophy that limits features in order to improve program simplicity is the wrong approach in these times of ever increasing CPU power. 2) If authors use a GUI to generate ebooks, then they don't care about the markup, which then negates your argument for lightweight markup from the perspective of authors. 3) Lightweight markup is suitable for people who digitize books (like p.g.) but not for people who create books, since people who digitize/convert books typically don't care about advanced features, while people who create them do. Some new points: 1) If you aren't open sourcing your code then good bye and good luck. All you're doing then is defining a specification. Any 10 year old that spends a week thinking about the requirements for an ebook format could do that. 2) Considering that you are designing a limited specification with closed source authoring/viewing software support for changes to that format (which will have to be made over time) will be spotty at best. Finally: When it comes to designing format converters, the key is the output format. If you choose an output format that is a superset of all input formats you might consider, it is then possible to use the converter to convert all input formats to a single output format. You do this by using a object model internally in the converter software, with plugins for input formats. And it them becomes easy to output to different formats using the object model. Starting with an output format that is more limited than possible input formats is simply ass-backwards. As I said before zml might be a good idea for conversion of txt files for p.g. but little else. And without an opensource converter from zml to html it is emphatically not a good idea.