View Full Version : What "Cleaning Up" Do Project Gutenberg Texts Need [closed]


bowerbird
11-02-2007, 05:08 PM
Editor's Note: bowerbird graciously allowed us to move this post to its own thread, it was originally posted here (http://www.mobileread.com/forums/showthread.php?p=111859#post111859).
-- NatCh

vivaldirules said:
> I had dreamy visions of downloading all of Project Gutenberg and
> carrying a fair fraction of all mankind's knowledge and wisdom with me
> everywhere. This was to be a monumental step-change in my life.
> The first thing I did was to download a book from there in TXT format,
> import and copy it to my Reader, and with great anticipation I sat down
> to enjoy it. I was instantly dismayed by how distracting it was to read
> with broken lines, forced hyphenation, poor pagination, and no indexing.

this has been something i've been working on for some time...

and, ironically, i just sent a message to the p.g. listserves yesterday,
constructing a list of the things that need to be done to a p.g. e-text
in order to make it typographically beautiful, and asking for input...
so i will repeat the list -- and the request for input -- here, for you...

> the idea is that you've loaded a plain-ascii p.g. e-text into
> your word-processor or desktop-publishing program with
> the objective of making it beautiful. what exactly do you do?
>
> please add to this, the start of a list, off the top of my head:
>
> 1. get rid of that ugly legalese at the top of the file.
> 2. make the title-page and front-matter look nice.
> 3. hotlink the table of contents. make one if necessary.
> 4. make all the headers big, bold, and distinctive, and
> 5. start chapters on a new page, maybe even a recto.
> 6. get rid of the empty lines between paragraphs, and
> 7. use book-style indents on each paragraph instead.
> 8. use full justification. or at least half-ragged.
> 9. use a reasonable line-width. full-screen is too wide.
> 10. white-space is free in an e-book, so use it liberally.
> 11. make block-quotes distinctive, for remix purposes.
> 12. links are great, but spare us the ugly blue underlines.
> 13. is an unlucky number.
> 14. don't put pagenumbers inside the text/paragraphs.
> 15. turn pg-ascii underscored text into _real_ italics.
> 16. pictures (even doodad thingees) enliven the text.
> 17. navigation aids among chapters are quite useful.
> 18. footnotes should have links going _both_ ways.
> 19. if it works better that way, turn a table on its side.
> 20. resize tables and images so they fit on one screen.
> 21. give your readers the luxury of generous leading!
> 22. block-quotes should be indented on the left and right.
> 23. create running heads and/or footers on each page.
> 24. (leaving some space for you...)
> 25. (leaving some space for you...)
> 26. show where we are in the book (page 39 of 208).
> 27. make the framework of the document _obvious_.
> 28. what the heck, just for the fun of it, make an index!
> 29. make the typesize big enough to be read easily!
> 30. get rid of that ugly legalese at the bottom of the file.
>
> these are general strategies. not all of them will be
> applicable to any one specific situation, and some
> (e.g., #8) are up to the preferences of the individual.
>
> and obviously, some of these could be fragmented
> into a very large number of sub-points, like #10...

again, if there's anything you can add, i would appreciate it.

my aim is to write a program that will do a mass-beautification
of the entire project gutenberg library. i've made good progress.


> I don't have to tell you what happened when I then tried PDF files
> from the Internet Archive.

i wish you would've told us. i assume the text was too small to read.

-bowerbird

vivaldirules
11-02-2007, 05:18 PM
i wish you would've told us. i assume the text was too small to read.

Yes, all the PDFs there are from scanned images that are far larger than the Sony screen and require some enhancement from PDFLRF (another useful workaround by MR folks) or something similar. That means a lot more time and effort fooling around and ends with huge files when all I was after in the first place was some simple well-formatted text.

Nice list of wants from PG. I'll give this more thought.

DaleDe
11-02-2007, 05:27 PM
vivaldirules said:

again, if there's anything you can add, i would appreciate it.

my aim is to write a program that will do a mass-beautification
of the entire project gutenberg library. i've made good progress.


> I don't have to tell you what happened when I then tried PDF files
> from the Internet Archive.

i wish you would've told us. i assume the text was too small to read.

-bowerbird

Part of beautification I do is to make curly quotes (double and single), curly apostrophes, and dashes that are really dashes. Sometimes I adjust hyphenation to make better looking lines.

On small screens the margin needs to be almost as wide or as wide as the screen. To many page changes interrupts the reading pleasure.

Images should be adjusted for color (perhaps converted to gray scale) and adjusted for the screen.

For gutenberg you need to check paragraph splits, bad scan errors and other items. You almost have to read the whole book I am afraid to get it right.

Dale

JSWolf
11-02-2007, 05:45 PM
The thing to do is take the book and convert it. Then read it and fix it and post the fixes. Also if someone else is reading it then please post any corrections you find so they can be fixed. Then a new version can be made and posted. Because it's not feasible to read before posting. But, Like Music of the Spheres I did fix a few things I found right away and fixed other things as I read it till I was done and fixed all I found.

NatCh
11-02-2007, 05:46 PM
What you say is true, JSWolf, but bowerbird is trying to come up with an app to fix as much of the obvious stuff automagically as possible. :nice:

JSWolf
11-02-2007, 05:49 PM
26. show where we are in the book (page 39 of 208)

Isn't that up to the software doing the displaying of the file as to how it displays the page number?

JSWolf
11-02-2007, 05:51 PM
What you say is true, JSWolf, but bowerbird is trying to come up with an app to fix as much of the obvious stuff automatically as possible. :nice:

It should not be too hard to make such an app. to do some of what's needed an initial clean up. The hardest think I think for a lot of people is the page numbers. I know Word and Book Designer have regexp. But if you don't know it or know you can use it to remove the page numbers, then you will either have to do it manually or leave them in or not convert.

bowerbird
11-02-2007, 07:31 PM
dalede said:
> curly quotes (double and single), curly apostrophes,
> and dashes that are really dashes.

curly-quotes and em-dashes, how could i forget those? :+)
oh well, like it said, it was a list off the top of my head...

***

jswolfe said:
> Isn't that up to the software
> doing the displaying of the file as to
> how it displays the page number?

yes sir, there is a mixture of things in the list,
some of which aren't relevant for all situations,
and some of which are geared to functionality,
not beauty. (except functionality _is_ beautiful.)
practically none of them are fully cut-and-dried.

for that particular item, i was thinking of .pdf,
and other formats of the fixed-page persuasion,
where the number of total pages is known for
any particular conversion (e.g., at textsize=12),
so as to be a basis for a relativistic comparison
with another conversion (e.g., at textsize=16)...

something like "page 180" has very little meaning
if we don't know if there are 200 or 800 or 1600
pages in any one particular conversion of a book.


> It should not be too hard to make such an app.
> to do some of what's needed an initial clean up.

actually, it's not really easy, for the simple reason that
p.g. e-texts have maddeningly inconsistent formatting...

even where there is a straightforward rule on something --
e.g., there should be 4 blank lines before a chapter heading
-- consistency checks weren't done to ensure that's the case.
so even something as relatively simple as finding headings
must be engineered to catch inconsistencies, and of course,
since you don't _know_ all the ways they were inconsistent,
it's not that simple to know what code you have to engineer.

the worst part of all -- as i'm sure you volunteers who have
converted p.g. e-texts already know -- is that p.g. employed
no method to inform us what lines should _not_ be rewrapped,
such as lines of poetry, lines in a table, lines in address-blocks,
and so on. finding and fixing these lines can be time-consuming.
and writing routines that can root them out is not entirely trivial.

my intention is to fix the files, and mount a mirror with my files.
michael hart graciously agreed to provide diskspace/bandwidth...

-bowerbird

p.s. i agree entirely that whatever beautification is done needs to
be subjected to the quality-control process of being read by people.
my hope is that the beauty of the files will be an alluring invitation.
for my part, i intend to make error-reporting and feedback _much_
more simple and responsive than it is with project gutenberg itself.
it will be more wiki-like, in that reports will be immediately visible...

Nate the great
11-02-2007, 07:49 PM
This first set will be fairly easy to implement (if outputting in HTML).
> 1. get rid of that ugly legalese at the top of the file.
> 3. hotlink the table of contents. make one if necessary.
> 4. make all the headers big, bold, and distinctive, and
> 6. get rid of the empty lines between paragraphs, and
> 7. use book-style indents on each paragraph instead.
> 8. use full justification. or at least half-ragged.
> 10. white-space is free in an e-book, so use it liberally.
> 11. make block-quotes distinctive, for remix purposes.
> 12. links are great, but spare us the ugly blue underlines.
> 15. turn pg-ascii underscored text into _real_ italics.
> 18. footnotes should have links going _both_ ways.
> 22. block-quotes should be indented on the left and right.
> 29. make the typesize big enough to be read easily!
> 30. get rid of that ugly legalese at the bottom of the file.


I am not sure what these mean. Can someone elaborate?
> 27. make the framework of the document _obvious_.
> 9. use a reasonable line-width. full-screen is too wide.
> 14. don't put pagenumbers inside the text/paragraphs.
> 17. navigation aids among chapters are quite useful.
> 21. give your readers the luxury of generous leading!


The following are indeterminate because they are dependent on input or output. For instance, PG ASCII files don't really have tables. Creating a definition sufficiently broad to cover all possibilities but not screw with the surrounding text will be an interesting exercise.
> 5. start chapters on a new page, maybe even a recto.
> 16. pictures (even doodad thingees) enliven the text.
> 19. if it works better that way, turn a table on its side.
> 20. resize tables and images so they fit on one screen.
> 23. create running heads and/or footers on each page.
> 26. show where we are in the book (page 39 of 208).
> 28. what the heck, just for the fun of it, make an index! (This last one might take a lot of computing power. )

@bowerbird
Will the output be in HTML? That's the closest I know to a universal file type. You could create a BD file (in HTML0). I don't know yet if the specs are accessible.

Are you familiar with flex and yacc? They are what I would use to do this.


EDIT: @bowerbird I did not see your post until after I posted mine. Some of my questions have been answered.

bowerbird
11-02-2007, 09:32 PM
nate said:
> Will the output be in HTML? That's the closest I know to a universal file type.

i will transform the pg-ascii files into my own format -- z.m.l. --
which stands for "zen markup language", a light-markup system.

i designed z.m.l. based on pg-ascii, to speed the transformation.

or, more accurately, to tell the story the way it really evolved,
i original wrote a viewer-program for project gutenberg e-texts.

what happened was that writing the viewer-program was easy,
but resolving inconsistencies in the p.g. e-texts made it complex.

at some point, i threw up my hands and said, "it will be easier to
create a set of dirt-simple rules, and then make all the e-texts be
consistent with that rule-set, than continue efforts at overcoming
never-ending p.g. inconsistencies." fortunately, i'd already coded
routines to resolve the bulk of the inconsistencies, so it was simple
to "convert" an e-text just by outputting a _consistent_ version of it.
(for example, making sure it had 4 blank lines before each header.)

in the end, it's better to have a consistent library, because then other
developers can concentrate on adding value, instead of parsing text...

z.m.l. is a set of simple rules for expressing the _structure_ of a book.
that is to say, all the structural elements are indicated in a unique way.

this means a zml-viewer can take a z.m.l. file as _input_ and render it.

it also means that converter-routines can transform a z.m.l. file into
outputs of various types. thus far, i have focused on .html and .pdf...

z.m.l. viewer-apps are easy to program, because the z.m.l. format is simple.
i've already written various iterations in 3 languages, including basic and perl.

i'm biased, of course, but i think my viewer kicks other programs to the curb...
so, eventually, if you've got a book in z.m.l., you won't even want to convert it,
because there will be a zml-viewer-program that will run wherever it's needed...

long-run, i even expect browsers to accept z.m.l. and display it correctly.

here's a webpage where you can see zml-to-html canned demos:
> http://z-m-l.com/go/vl3.pl
click the linked book-titles to see the z.m.l., or the button to convert it.

if you're brave, you can even experiment with live zml-to-html conversion:
> http://z-m-l.com/go/zmldingus093.pl
click in the same canned demos from the url above, or click "skeleton"
to bring in a skeleton book, which you can edit, and then click "do it"...

-bowerbird

jbenny
11-02-2007, 09:40 PM
Two existing tools that do most of what you want: Gutenmark and HTML Book Fixer. Both free. They can be used separately or in combination, as they each have unique capabilities.

bowerbird
11-03-2007, 02:58 AM
jbenny said:
> Two existing tools that do most of what you want: Gutenmark and HTML Book Fixer.

um, no. not even close, really. honorable efforts, but my intention is much wider.

-bowerbird

Robert Marquard
11-03-2007, 03:20 AM
Please, bowerbird is pestering PG with his ZML for a long time. Distributed Proofreaders has banned him lately. Please do not let him promote his inadequate format here.

HappyMartin
11-03-2007, 05:04 AM
A lot of this stuff is way over my head. The only thing I'm not keen on is the "white space is free so use it liberally". I have some books with so much white space that I need to change pages very frequently. Drains the battery and slows everything down with the slower e ink refresh rates. Just a suggestion.

RWood
11-03-2007, 10:36 AM
So his questions were really a set-up for his ZML scheme.

Just what we need, another new format.

I'll stick to what I use now thank you.

vivaldirules
11-03-2007, 11:41 AM
Distributed Proofreaders

Now there's a group that could use some help. It certainly benefits us here if the texts at Gutenberg are accurate. Think I'll sign up to volunteer. Thanks for mentioning it! :book2:

kacir
11-03-2007, 12:38 PM
[COLOR="Navy"]
> 1. get rid of that ugly legalese at the top of the file.
> 2. make the title-page and front-matter look nice.
> 3. hotlink the table of contents. make one if necessary.
> 4. make all the headers big, bold, and distinctive, and
> 5. start chapters on a new page, maybe even a recto.
> 6. get rid of the empty lines between paragraphs, and
> 7. use book-style indents on each paragraph instead.
> 8. use full justification. or at least half-ragged.
> 9. use a reasonable line-width. full-screen is too wide.
> 10. white-space is free in an e-book, so use it liberally.
> 11. make block-quotes distinctive, for remix purposes.
> 12. links are great, but spare us the ugly blue underlines.
> 13. is an unlucky number.
> 14. don't put pagenumbers inside the text/paragraphs.
> 15. turn pg-ascii underscored text into _real_ italics.
> 16. pictures (even doodad thingees) enliven the text.
> 17. navigation aids among chapters are quite useful.
> 18. footnotes should have links going _both_ ways.
> 19. if it works better that way, turn a table on its side.
> 20. resize tables and images so they fit on one screen.
> 21. give your readers the luxury of generous leading!
> 22. block-quotes should be indented on the left and right.
> 23. create running heads and/or footers on each page.
> 24. (leaving some space for you...)
> 25. (leaving some space for you...)
> 26. show where we are in the book (page 39 of 208).
> 27. make the framework of the document _obvious_.
> 28. what the heck, just for the fun of it, make an index!
> 29. make the typesize big enough to be read easily!
> 30. get rid of that ugly legalese at the bottom of the file.


At point 1.
I do agree that that legalese is ugly.
I do not think, however that we should remove it entirely.
I suggest moving it to the end of the file and placing a line like this "for licence see bottom of file" instead.

At point 4.
I personally do not like fancy headers. When I format book for myself I remove all fancy formating from headers.

at point 8.
This is again a matter of personal preference. Fully justified text does *look* better. When it comes to hours and hours of continuous reading many people find those "ragged" left justified paragraphs easier to read.

point 23.
Here again I have to politely disagree with you

point 24.
set page margins as small as possible. We have paid a lot of money to have the screen as big as possible. So why do we want to waste 20% of the screen real estate by margins. Wide margins make sense in a printed books. Reader, however, has "built in margin" around the screen. This is the issue that bothers me the most when I see e-book from the connect store

point 25.
Use sans serif font. Just like with fully justified text the serif font looks better in a printed books. In a low resolution display (and anything below 300dpi *is* low) the sans-serif font is much more readable.

point 30.
see point 1.

If you want to use Microsoft Word for formating the documents, beware. Microsoft products use symbols (like left and right quotes) that are not according to standards. This is most notable when you use standards compliant browser (like Firefox) to view html page that was generated using MS Word. Or, if you have text with fancy curly quotes and you upload it an an rtf file to the reader. Plain, simple, basic quote " does not look as nice as typographical one, but it will display correctly on any reading device.

Nate the great
11-03-2007, 12:48 PM
@kacir
We remove the license stuff because we do not wish to agree to the license.

JSWolf
11-03-2007, 12:57 PM
The problem (in my opinion) with ZML is that it's non-standard. There are NO converters that will read it and convert it into something that can then be used to generate a book that will be readable on a portable book reader. Book Designer won't read it, libprs500 won't convert it. So non of the devices we have... cell phones, PDA, eink readers will actually have anything to do with ZML. So why not just do the cleanup of the PG text and output standard HTML that is 100% adequate for the task at hand and thus can either then be read as is or converted easily to be read on a portable reading device? What is is about ZML that make it that much better then HTML? And what is there in a PG text that HTML cannot handle?

bowerbird, you do make some valid points about the consistency (standardization) and some of the problems with PG text. And then you go wanting to use ZML which is not a standard at all that nobody can use to convert to something they can actually read. Can you see why we have a problem here? You are trying to fix something that may possibly be broken and add a layer of inconsistency to that so it comes out even more broken then before you started.

JSWolf
11-03-2007, 01:04 PM
@kacir
We remove the license stuff because we do not wish to agree to the license.
I thought we removes the license stuff because it is easier to remove then to format it so it looks decent.

What I do is keep the PG stuff at the beginning of the book as it's only one page. I do remover the multi-page license at the end. I have (in the past) formatted it so it looks decent and is more readable. But nobody (IMHO) is really going to bother to read it.

tompe
11-03-2007, 01:10 PM
How do you represent the semantical structure with HTML?

I am suprised that no "exchange format" has been used for conversion and that format put up in the book section here. Now I find books formatted for Sony that looked interesting but I wanted Mobipocket format but they were not available,

jbenny
11-03-2007, 02:09 PM
jbenny said:
> Two existing tools that do most of what you want: Gutenmark and HTML Book Fixer.

um, no. not even close, really. honorable efforts, but my intention is much wider.

-bowerbird

Boy, I can feel your disdain all the way across the internet. The two tools I mentioned may not be perfect, but they already exist and are very useful. When your vaporware program reaches perfection and is actually available, let me know and I'll give it a try.

DaleDe
11-03-2007, 03:41 PM
I thought we removes the license stuff because it is easier to remove then to format it so it looks decent.

What I do is keep the PG stuff at the beginning of the book as it's only one page. I do remover the multi-page license at the end. I have (in the past) formatted it so it looks decent and is more readable. But nobody (IMHO) is really going to bother to read it.

Having it at the beginning is problematic for many ebook readers that can only jump back to the beginning of a document. the eb1150 requires a tag name="toc" to allow it to find the toc and many people leave it out so then it has the same problem. The TOC needs to be near the beginning to that it can be reached easily. It should not be 10 pages in! Therefore I vote for moving it or removing it.

Dale

DaleDe
11-03-2007, 03:45 PM
How do you represent the semantical structure with HTML?

I am suprised that no "exchange format" has been used for conversion and that format put up in the book section here. Now I find books formatted for Sony that looked interesting but I wanted Mobipocket format but they were not available,

The exchange format should be xhtml and its logical container epub. This is what we should promote and focus on. BD's format html0 is horrible html and coming up with another one is silly IMHO since we have one already. We should be active in ensuring the xhtml does what we need and epub does what we need.

Mobipocket is committed to importing epub even if they don't support it natively.

Dale

JSWolf
11-03-2007, 03:53 PM
At point 1.
I do agree that that legalese is ugly.
I do not think, however that we should remove it entirely.
I suggest moving it to the end of the file and placing a line like this "for licence see bottom of file" instead.
If we move to the end, we can put in a ToC entry to it in case anyone does want to read it.

At point 4.
I personally do not like fancy headers. When I format book for myself I remove all fancy formating from headers.
My way of doing it.. main text in serif font and headers in san-serif and larger with bold (I think). Nothing fancy is needed. Just an easy way to let us know this is a chapter title.

at point 8.
This is again a matter of personal preference. Fully justified text does *look* better. When it comes to hours and hours of continuous reading many people find those "ragged" left justified paragraphs easier to read.
Personally I prefer justified with small margins or no margins (on the Sony Readers)

point 23.
Here again I have to politely disagree with you
I do have headers and /or footers for the books I create in LRF. But there i no need for these in the actual text. It just makes it like a converted PDF where we then have to strip it out. In my case Book Designer or html2lrf creates the headers/footers without them being in the text.

point 24.
set page margins as small as possible. We have paid a lot of money to have the screen as big as possible. So why do we want to waste 20% of the screen real estate by margins. Wide margins make sense in a printed books. Reader, however, has "built in margin" around the screen. This is the issue that bothers me the most when I see e-book from the connect store
I have been taking to setting the margins to 0 as I like to have as much on screen as I can fit. And it takes less pages and less battery to read a given book that way. Wide margins annoy me. On a paper book, there are reasons for the margins and they are fine. But on a reading device, it makes n sence to have wide margins.

point 25.
Use sans serif font. Just like with fully justified text the serif font looks better in a printed books. In a low resolution display (and anything below 300dpi *is* low) the sans-serif font is much more readable.
This is personal preference. I like the serif better.

If you want to use Microsoft Word for formating the documents, beware. Microsoft products use symbols (like left and right quotes) that are not according to standards. This is most notable when you use standards compliant browser (like Firefox) to view html page that was generated using MS Word. Or, if you have text with fancy curly quotes and you upload it an an rtf file to the reader. Plain, simple, basic quote " does not look as nice as typographical one, but it will display correctly on any reading device.
The MS Word generated curly quotes do work in Book Designer for generating LRF and Mobi format books.

bowerbird
11-03-2007, 04:03 PM
jbenny said:
> Boy, I can feel your disdain all the way across the internet.

you know, i _really_ dislike it when people try to put words in my mouth,
or try to tell other people what _i_ think, or how _i_ feel about something.

"honorable" is _not_ a term of "disdain", not in my neck of the woods.

and i've had lots of friendly discussion with ron burkey, who made gutenmark,
so i really resent that you'd try to soil the pleasant nature of our relationship...

ron would tell you himself that my scope is entirely different than his was...
he would also tell you -- as he told a lot of people -- that as gutenmark grew
better and better, his ambitions grew even faster, so it "fell short" of them...

and since he'd gotten tired of maintaining it -- an important consideration,
wouldn't you say? -- he was very supportive of me and my similar efforts.

***

as for the other tool you cited, it's aimed at .html files, so its focus is limited.
and even more importantly -- to me -- it's windows-only. so, as a mac person,
it _doesn't_exist_ for people like me. my tools are cross-platform, thank you,
mac and windows, and i'll even make linux versions if i get some beta-testers.

***

plus, your characterization of my work as "vaporware" is an entirely vapid insult.
if you really want to "give it a try", then visit the websites i pointed to up above.
because when you call something "vaporware", and someone else can simply
\visit a website and see it in action, it makes you look kind of... well, _stupid_.

anyway, where's a moderator when some _real_ disrespect is being manifested?
(and no, that's not a call for a moderator. it's better to let rudeness reveal itself.)

-bowerbird

Nate the great
11-03-2007, 04:09 PM
Since bowerbird is going off in his own little world, I decided to create my own autoformatter for PG ebooks. My planned output format is very basic html. It's the closest to a universal format right now. I think I will be able to implement most of the suggestions in the first post of this thread.

The exchange format should be xhtml and its logical container epub. This is what we should promote and focus on. BD's format html0 is horrible html and coming up with another one is silly IMHO since we have one already. We should be active in ensuring the xhtml does what we need and epub does what we need.

Mobipocket is committed to importing epub even if they don't support it natively.

Dale

One problem with xhtml as an exchange format is lack of backward compatibility. (I also don't know what the tags are, so if you could point me at a tutorial I would appreciate it.) On another point, does any device use epub yet?

kovidgoyal
11-03-2007, 04:19 PM
Since I have a little experience writing converters, I'd just like to say that if somebody does write a new improved gutenberg to html converter to use a well defined semantic scheme by CSS classes. This would make the HTML much more suited to conversion into a ebook format like epub or LRF.

Some important things to have in the generated HTML would be

1. A meta tag identifying the type of file (i.e. identifying it as the output of that automatic converter). This is necessary for parsing the semantic information.

2. CSS classes for things like page breaks, chapter titles, chapter subtitles, inline vs. block vs full page images.

3. Use of semantic HTML tags like <em>, <strong> instead of <bold> and <i>

etc.

tompe
11-03-2007, 04:26 PM
And it would be nice if new tools are platform independent.

jbenny
11-03-2007, 04:46 PM
Since bowerbird is going off in his own little world, I decided to create my own autoformatter for PG ebooks. My planned output format is very basic html. It's the closest to a universal format right now. I think I will be able to implement most of the suggestions in the first post of this thread.

One problem with xhtml as an exchange format is lack of backward compatibility. (I also don't know what the tags are, so if you could point me at a tutorial I would appreciate it.) On another point, does any device use epub yet?

Not a tutorial, but a very nicely organized reference: http://xhtml.com/en/xhtml/reference/

I agree with DaleDe that the use of XHTML makes the most sense. For those who don't know how it differs from HTML, the reference I mentioned above will tell you exactly what is supported. XHTML is not some strange new beast. It is simply a more formalized version of the HTML that we all know. Some of the bad habits of HTML have been eliminated, some tags deprecated and a consistent structure is required. There is XHTML 1.0 Transitional, XHTML 1.0 Frameset, XHTML 1.0 Strict and XHTML 1.1. The latest 1.1 version is essentially 1.0 Strict and continues the process of clearing out some of the crud from HTML. There is a good reason that the IDPF folks specified XHTML 1.1 for use in epub.

Oh, and there is no reason that XHTML can't be backwards compatible, since it is really just HTML that has been cleaned up a bit. Even on such non-closed HTML tags like <br>, in XHTML you can use either <br/> or <br />. The second version should work correctly in older browsers.

The only major issue with XHTML is the use of <?xml version="1.0" encoding="utf-8"?> at the start of each document. This is the proper declaration to use, but Internet Explorer in particular has a problem with this and goes into "quirks" mode. If you need IE compatibility, you can leave this line out. For other purposes (especially epub), you should have it.

jbenny
11-03-2007, 05:01 PM
you know, i _really_ dislike it when people try to put words in my mouth
-bowerbird

Talk about the pot calling the kettle black. The disdain I felt was your reaction to my suggestion, not to the tools mentioned. I'm sure you want to have the last word, so go ahead. I won't be replying.

Nate the great
11-03-2007, 05:47 PM
@jbenny

Thanks for the assistance. I am currently making a list of all the tags I am planning to use; I removed the icky ones.

bowerbird
11-03-2007, 06:03 PM
happymartin said:
> The only thing I'm not keen on is the "white space is free so use it liberally".
> I have some books with so much white space that I need to change pages
> very frequently. Drains the battery and slows everything down with the
> slower e ink refresh rates.

thank you so much. this is something that never would've occurred to me.

one reason, of course, is because my focus was on "beautiful typography".

but another reason is just because i don't consider issues like battery life,
so your post has helped push out the edges of my thinking, and i like that.

however, this is just a list of tactics i have seen used. some of 'em might
be totally inappropriate in some situations. that's the nature of the beast.

moreover, some of 'em are subjective, depending on personal preference.
or, as in this case, subject to other concerns, such as battery-life effects.

what it boils down to is that the e-book must be created by the individual;
it should be customized to the preferences active in the particular situation.

this is one of my _guiding_ principles. the z.m.l. format is _a_format_,
which is brought to life by a viewer (or converter) that lets an end-user
customize the book to their specs. so if you want big margins, order them.
if you want small margins, just say so. specify the pagesize, the fontsize,
the font, the colors, nearly everything in the e-book is under your control.

it's folly to think that you can create a .pdf that will make everyone happy.
there's just too much idiosyncrasy in our preferences. so we need to make
our e-books as _flexible_ as possible, so they can take whatever form is
desired by each specific individual in each specific situation. any system
that doesn't give such flexibility to readers is, in my eyes, simply doomed.

beauty is in the eye of the beholder.

so we don't need to even discuss justified/ragged or tight/loose leading
or any of the many issues. we need to make sure we give customization.

***

robert said:
> inadequate

interesting how some people want to make your decisions for you, isn't it?
the proof is in the pudding. the proof is in the pudding.

***

rwood said:
> So his questions were really a set-up for his ZML scheme.

and interesting how quickly some people jump to the wrong conclusion.

no, the question isn't a "set-up" for anything. it's a legitimate topic --
and it grew here directly out of the observation made by vivaldirules,
an observation back in the thread in which this was originally posted.

and, as indicated by the substantial head-start i provided with my list,
it's a topic i've done a lot of thought on, thought that i shared with you,
in a list people can now use as a guide in their own beautification efforts.


> Just what we need, another new format.

well gee, i suppose if i introduced it as "the format to end all formats",
maybe it would be ok with you? just like openreader, or oeb, or epub?

the fact of the matter is that it'll be the marketplace that decides on
which format prevails, and until that decision is made (which might
take several decades), there's no reason for any contender to defer...

contrary to what some people believe, a large number of formats does
_not_ hold back a technology when its time has come, as proven by
the huge growth of word-processing despite a plethora of formats...


> I'll stick to what I use now thank you.

what _you_ use, now or later, is immaterial to the rest of the world...

but let me be totally clear. i'm not making a big push for my format.
that kind of hype and spin won't work anyway, as openreader showed.

what i'm going to do is mount the entire p.g. library using my format,
and demonstrate how a simple-but-powerful format can prove itself
by delivering a tremendous cost-benefit ratio to authors and readers,
even without backing from deep pockets like those of amazon or adobe.

besides, plain-ascii as a format has _already_ proven its value clearly.
the only problem with the format is that it's _ugly_, and that is why
i'm concentrating on making it _beautiful_ and even _more_ powerful.

***

jswolf said:
> The problem (in my opinion) with ZML is that it's non-standard.

it will be a standard some day. well, maybe not z.m.l. per se, but
_some_ form of light markup. the current leader is "markdown",
which is being used all over cyberspace, including in wordpress...
it's getting play in all kinds of software. heck, the actual _manual_
for textmate -- a text-editor highly regarded by mac technoids --
is made in markdown. the light-markup revolution is unstoppable,
because ordinary people don't want the hassle of doing markup...

heck, you can see traces of it in the comment-box in this forum,
where a u.r.l. is automatically turned into the appropriate hyperlink.
that's the exact type of thing that light-markup does for a person,
and there's no earthly reason why we wouldn't want such assistance.

furthermore, you will find that markdown has a _lot_ of support.
(if you want to do a google search, use "markdown and textile" to
avoid false alarms from merchants offering a "markdown" in price.
"textile" is another form of light markup.) one of my _favorite_
markdown-related sites is a live authoring site called "showdown":
> http://www.attacklab.net/showdown-gui.html
i encourage you to try it. see how easy and powerful markdown is.

oh yeah, most of the light-markup systems output (x)html, so
the argument that light-markup is "nonstandard" is kinda silly.
heck, markdown was made, and is maintained by, standardistas.

z.m.l. and other light-markup systems know the _structure_ of
the document -- they know exactly what each element _is_ --
so they can generate any kind of "standard" output you want...


> There are NO converters that will read it and convert it into
> something that can then be used to generate a book that will be
> readable on a portable book reader.

well, _i_ am making converters for z.m.l.

i have already done some, and i will do more...
so it's totally inaccurate to say there are none.

more importantly, the rules are so simple that any programmer can
make a converter, really. it doesn't take much coding chops to do it.

even a z.m.l. _viewer-program_ is a fairly easy programming task, and
converters are an order of magnitude easier than making a viewer-app.
(you mean i don't have to _handle_ the footnote, just _mark_it_? cool!)

.html from my converter works fine on the web, my current objective.
soon i will do whatever tweaks are necessary to make sure it converts
well to mobipocket, since amazon will keep that format alive for years.
and i'll probably target the rocketbook, for romantic historical reasons.

as for other formats, the numbers on plucker indicate i can ignore that.

and with the iliad and sony supporting .pdf (which i can convert well),
i can see no good reason to make a converter into their other formats.

and again, z.m.l. viewer-apps are easy enough to program and port
that there'll be one for every important platform before you know it,
so i don't see conversions being all that important in the long run...

i'm of the opinion that the iphone will become the common denominator
we have to consider, and through its web-browser, i already support it...


> What is is about ZML that make it that much better then HTML?

it's easier for developers to add value around the simple format of z.m.l.
than to try and engineer around all of the markup cruft in an .html file...

surely you know the gamut of rationale in support of plain-text files,
and the usefulness of the tool-chain that can take advantage of them...

plus, the z.m.l. viewer-program is built specifically for the task of books
(and long documents composed of many "chapters"/sections in general),
so they give a superior reading experience over the clunky web-browser.

but again, i'm not trying to "sell" you on z.m.l. if you like your .html, fine.
go ahead and use it. i'm aiming at authors who don't want to do markup,
but still want to offer to their readers a high-powered e-book experience.


> And then you go wanting to use ZML which is not a standard at all
> that nobody can use to convert to something they can actually read.

well, you're simply wrong that "nobody" can use z.m.l. to do conversions.

did you visit the web-pages i pointed to up above?
> http://z-m-l.com/go/vl3.pl
> http://z-m-l.com/go/zmldingus093.pl

you can see conversions in action.

and here's a .pdf conversion i did as a demo that you can look at:
> http://z-m-l.com/oyayr/oya-sunday.pdf
40+ sections and 120+ footnotes in that baby, most with weblinks.

i wanted to retain the same linebreaks (mostly) as the .pdf i was
duplicating, which meant that i had to make the type too small
for you to read it on a sonyreader, but i think you will agree that
i've demonstrated that z.m.l. can convert well to the .pdf format.
(although i'm wide open to any constructive criticism you have.)

i understand and actively support your desire for a range of tools
that support a format. i've been working for a long time to create
a z.m.l. toolchain that arcs the entire workflow of electronic-books,
from the beginning stages of authoring, to web-based publishing,
continuing through conversion, and extending through remixing...

-bowerbird

p.s. sorry for the long message. i guess igorsk is gonna have
a whole lotta whitespace in his browser when he encounters this.

jbenny
11-03-2007, 06:06 PM
@jbenny

Thanks for the assistance. I am currently making a list of all the tags I am planning to use; I removed the icky ones.

Yep, we need to banish all those "icky" tags :)

BTW, the "quirks mode" problem with IE may or may not apply to IE7. I don't use IE7 so I can't test this. Does anyone else know for sure?

bowerbird
11-03-2007, 06:13 PM
jbenny said:
> The disdain I felt was your reaction to my suggestion

well, you imagined that. i had no "disdain" for your suggestion.
it simply indicated you had no grasp of the scope of my intent,
which is much wider than the scope of the tools you mentioned.

and when you followed up the word "disdain" with the ridiculously
untrue charge of "vaporware", you brought about your own trouble.

my tools are in-progress and proof-of-concept, so there are _many_
criticisms that can be leveled against them. but "vaporware"? nope.

-bowerbird

DaleDe
11-03-2007, 06:22 PM
Yep, we need to banish all those "icky" tags :)

BTW, the "quirks mode" problem with IE may or may not apply to IE7. I don't use IE7 so I can't test this. Does anyone else know for sure?

IE7 handles this fine. By the way IE7 is my favorite RSS reader for a PC. It can handle XML RSS files fine as well. I am not an IE fan by the way. I use SeaMonkey for a browser.

Dale

bowerbird
11-03-2007, 06:29 PM
kovidgoyal said:
> Since I have a little experience writing converters, I'd just like to say that
> if somebody does write a new improved gutenberg to html converter
> to use a well defined semantic scheme by CSS classes.

um, to repeat, there's a good reason why no one will write an "improved"
gutenberg-to-html converter. it's the same reason that made ron burkey
give up on gutenmark, namely, the inconsistencies riddling p.g. e-texts.

until those inconsistencies are cleaned up, a converter is a pipe-dream...

however, once those inconsistencies _are_ cleaned up, we no longer need
to _convert_ the e-texts to _any_ other format, because their consistency
will mean that viewer-programs can be made to handle their native format.

this presents the existential conundrum of heavy-markup.
until it can be applied _automatically_, its cost is too high.
but once it _can_ be applied automatically, it's unnecessary,
because the very same routines that convert text to xhtml so
that xhtml can be rendered by a display-program can instead
be put into a viewer-app that eliminates the xhtml middleman,
by working directly with the text as its input to create its output.

once you understand this, deeply, markup becomes a bad joke.

we take simple text and turn it into complicated markup, and then
we need a complicated program to handle the complicated markup
and turn it back into simple text that can be displayed. it's just silly.

once i show people markup is unnecessary, they'll laugh at you for doing it.
and i don't say that to _mock_ you; i say it so you can avoid looking stupid...

-bowerbird

JSWolf
11-03-2007, 06:30 PM
But can ZML be read on any currently available portable reading device?

bowerbird
11-03-2007, 06:39 PM
jswolf said:
> But can ZML be read on any currently available portable reading device?

no. you will need to click one button to turn it into .html,
or click another button to turn it into a .pdf, after which
you'll have an html-book or a pdf-book that's customized
to your preferences on a number of different dimensions,
including ones like page-size, margin-size, and font-size.

in the future, if those "portable reading devices" can be
_programmed_, then someone will create a zml-viewer
for them, because otherwise p.g. e-texts are too ugly...

it might even be the case that big companies like adobe
end up handling z.m.l., because they'll become the butt of
jokes if they "fail" to support such a simple, useful format
that has gained widespread use from the general populace.

-bowerbird

bowerbird
11-03-2007, 06:44 PM
in other words, i'm betting on simplicity against complexity...

because i can give my simplicity _vastly_greater_functionality_
while other developers are busy tripping over their complexity...

and i think that's the winning argument here.

but that's not what's important, because "winning arguments"
in forum "discussions" don't carry any weight in the real world.

what matters in the real world is the functionality you give users.

the proof is in the pudding, my friends. the proof is in the pudding.

-bowerbird

JSWolf
11-03-2007, 07:52 PM
Bowerbird can you show us some samples so we'll know exactly what you are on about? I think that would help your case most of all.

bowerbird
11-03-2007, 08:07 PM
jswolf said:
> can you show us some samples

ok, for the _third_ time now, go here:
> http://z-m-l.com/go/vl3.pl

first click the blue underlined link to view the .zml file.
then go back and click the button to convert to .html.

notice that among these documents is one containing
"the 11 rules of z.m.l.". another one is a "test-suite"
that lists structures that are commonly found in books.

***

if you want to try some z.m.l. out yourself, go to this page:
> http://z-m-l.com/go/zmldingus093.pl
you can click any of the buttons to convert a document.
(these documents are the same ones on the page above.)

you can also click "skeleton" to bring in the _skeleton_ of
a document for you to edit. make some changes to convince
yourself it is a live demo, and then click "do it" to convert it.

again, this is all "proof of concept" and "in-progress" and
whatever other label you need to realize it's not done yet.

i'm working on a toolchain that stretches across the workflow,
and it's just starting to cohere as a whole, so i am not focused
on any one particular part now, so some things might not work.

and, again, i'm not trying to "convince" anyone of anything, so
there's no sense "resisting". it doesn't matter if you ain't buyin',
because i ain't sellin'... :+)

nonetheless, constructive criticism will be welcomed.

and as you can tell, i have thick skin, so don't worry about
being "polite". i can handle it.

-bowerbird

tompe
11-03-2007, 08:27 PM
Number of emty lines and space as significant markup seems like a bad idea. Also I do not see how to markup quotes inside quotes, e.g: "Allan's word 'green' is blue".

But I think I get your idea. And I can sympatize with en easy editable format to get people to use it.

bowerbird
11-03-2007, 09:07 PM
tompe said:
> Number of empty lines and space as significant markup seems like a bad idea.

as for the number of empty lines, the zml-authoring-tool does the counting for you,
displaying headers appropriately, so in practice, it actually works out pretty well...

the same with elements that depend upon spaces. plus, as one person pointed out,
python's dependence upon spaces hasn't seemed to have many deleterious effects...
i also have the benefit of the experience other light-markup systems have gained...
markdown depends even more on spaces than z.m.l., and nobody complains about it.


> Also I do not see how to markup quotes inside quotes,
> e.g: "Allan's word 'green' is blue".

i'm not sure i understand you. you'd do it just like you did it in your example...

-bowerbird

tompe
11-03-2007, 09:26 PM
I didn't realize an authoring tool was required. I like to be able to write things in a text editor and I thought the goal was to make this possible. If you are going to use an authoring tool I do not see the reason for this kind of markup.

You said in the specification that quotes could be replaced to whatever the user wanted and for this you have to be able to distinguish between a quote and constructions like 'em. And it seems impossible to do this with your rules. And you have examples like: "The coordinate was 49 12' 27" N"

bowerbird
11-03-2007, 10:03 PM
tompe said:
> I didn't realize an authoring tool was required.

it's not "required". but it's available if you want.

or just pull your text into it every once in a while,
to make sure it behaves the way you want it to...

here's a screenshot of the authoring tool:
> http://z-m-l.com/go/zml-sandbox01.jpg

also:
> http://www.z-m-l.com/go/rieger/oya-cover.html

as you can see, one side is the text editfield,
and the other is how it will look in the viewer.


> I like to be able to write things in a text editor

me too.


> and I thought the goal was to make this possible.

that's one of the goals, yes.

but that doesn't mean we can't give people a dedicated
authoring-tool too. different strokes, and all that rot...


> If you are going to use an authoring tool
> I do not see the reason for this kind of markup.

there's a lot of utility in wysiwyg. that's why it's popular.

and in terms of _learning_ z.m.l., the authoring-tool is great.
once you've internalized the simple rule-set, you don't need it.
though wysiwyg is still nice. but if you prefer workin' blind, do...


> You said in the specification that quotes could be replaced
> to whatever the user wanted and for this you have to be able
> to distinguish between a quote and constructions like 'em.

aha, i see what you're talking about now -- curling the quotes.
yeah, it takes a little bit of magic in your coding to do it right...

when i release my program, i will enjoy seeing if you can fool it. :+)

(i'm sure you've noticed that microsoft's routines are quite brain-dead.)


> And it seems impossible to do this with your rules.

the impossible just takes a few more processing cycles... ;+)

seriously, when i say "it's done now", just type naturally,
and see if the program figures it out. if not, let me know.

if a human can puzzle it out, my routines should be able to do it too.
(of course, if it's ambiguous even to a human, then all bets are off.)


> And you have examples like: "The coordinate was 49 12' 27" N"

if i need to (and it won't be for a clear example like this, but if i need to...),
i'll fall back to the position that z.m.l. uses utf8, so use that to disambiguate.
magic i can do. but mind-reading is something else entirely...

-bowerbird

Panurge
11-04-2007, 12:28 AM
> 14. don't put pagenumbers inside the text/paragraphs.

For the casual reader, this may not be an important point, but for someone who publishes scholarly texts, which require documentation, it is. The page numbers of the original text do matter, as does the exact text that lies between them. I am the director of a library, and we had one of the first libraries in the country to install an automatic checkout system (in 1971 or so). When we tried to migrate from our IBM punchcards to a more up-to-date system fifteen years later, we discovered that the EBSDC coding could not be converted to ASCII (not enough computer power), and we had to re-enter every single record by hand. I can understand that no one wants to repeat this kind of conversion every time we move to new hardware and formats, hence the mild controversy over a new proposed encoding standard. But what really matters for scholars who have to show in their footnotes where to locate the authority for the text they cite, a lack of representation of the pagination of the original renders the e-text useless. Now PG has performed an outstanding service in making available many an obscure and difficult-to-find text, and the use of unadorned ASCII text, the only practical standard usable at the time it was begun, was obvious. One of the benefits of PG is its attempt to check the accuracy of the texts being transcribed. I haven't checked their efforts, but I respect the intention. The Google scanning project is a laudable one, but it is so imperfect (sloppily-executed scanning evident in far too many examples, obviously done hastily and unchecked) so that I'm afraid much will have to be redone. It's hard to get it right the first time, and even if one does, the evolution of format and hardware means that there has to be a thoughtful plan for future migration. At the same time, we who are scholars have to decide whether or not the original print text-source is what we're going to refer to or the e-text facsimile. If the latter, do we regard it as a new edition or as a faithful representation of the print copy? If we don't account for these needs in our re-encoding now, we'll simply have to redo the e-texts in the future if we expect electronic texts to gain much of a oothold in the world of scholarship and education.

jbenny
11-04-2007, 12:44 AM
> 14. don't put pagenumbers inside the text/paragraphs.

For the casual reader, this may not be an important point, but for someone who publishes scholarly texts, which require documentation, it is. The page numbers of the original text do matter, as does the exact text that lies between them.

You bring up a very valid point that most of us don't think of (me included). Can you suggest a way to handle this without having the page numbers in-line with the text? Most of us would find the visible page numbers too obnoxious.

For XHTML markup, one thing that comes to mind (just off the top of my head) would be to enclose all the text that makes up an original page with a surrounding tag that uses the "id" attribute to hold the page number. This would not display, but could be accessed if needed. Also, by using "id", you could construct a special hyperlinked table of pages that would allow you to jump to specific pages in the ebook. I'll have to try this and see how it works.

Using XHTML, this would work with epub and possibly Mobipocket and other formats based on HTML. Anyone have ideas on other ways to do this, either in XHTML or other formats?

jbenny
11-04-2007, 01:07 AM
The attached Zip file is really an epub. To use it as such, rename it with an epub extension. The forum software won't let me upload it as an epub, even though it is really a zip file. You can either use the epub as is, or just unzip it and view the HTML file in your browser.

The content is totally bogus. I just made it up for this test. I used a <span> tag to mark the beginning few words of each page. Since a physical page is likely to fall mid-sentence, you can't use a block-level tag like <div>. Well, you could, but that would also break a sentence in the ebook, which is not what you want.

As for using a background color on the words that start a physical page, that isn't exactly ideal for ebook reading, either. I just did it to make it easier to see exactly where my imaginary page breaks were. Without some visual clue, you'd have to carefully scan the first few lines to match up the words, after jumping via the "table of pages" that I made at the end.

This is far from an ideal method, but it was the first thing that I tried. Perhaps someone has a better suggestion? How to delimit the page breaks for those who need them, while not being in-your-face for the average ebook reader? In a web browser, some javascript could make this a lot easier. However, I don't know of any ebook readers that do javascript (not counting PDAs).

bowerbird
11-04-2007, 03:14 AM
panurge, i feel where you're coming from. but let me run through a few thoughts.

so first, point #14 is about the embedding of pagenumbers inside of the text flow.
that's not a good idea, because they're a distraction that just needs to be removed
when we want to copy the text out for remixing. that's why point #14 is there.

my next comment -- which i say because it must be said -- is that it's not our job
to do your job. if the pagenumbers are valuable to you, it's your job to save them.
i'm sorry if that sounds cold, but that's the way it is.

having said that, however, let me move on to my next comment, which is that
i am in 100% agreement with you. even though pagenumbers are _irrelevant_,
in many senses, when we move a book to the digital sphere, i'm convinced that
we still need to retain pagenumber information, simply because so much of our
archival history uses pagenumbers as pointer-information. we cannot afford to
sacrifice that. indeed, i go one step further and argue that we should also be
retaining the _linebreak_information_ from all the paper-books that we digitize.
i won't go into all the arguments here, but in my mind, the answer is now clear.

furthermore, i put my money where my mouth is. in my digitization examples,
i maintain linebreaks and pagebreaks, and put the image-scan up next to the text,
so the end-user can verify the accuracy of my digitization if they want to do that.
i consider this checking by end-users to be the last fine line of the proofing process,
and i want them to feel like a part of the "march to perfection" that the text makes,
because i believe we need to make the public feel like "joint owners" of these books.
"the public domain belongs to _you_, the public, and you have responsibility for them,
so if there are errors here, you need to fill out an error-report so they are corrected."

to see some of my examples, check these out:
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html

you can thumb through these e-books just like they were the p-books,
and verify that the linebreaks and pagebreaks are exactly as they were.
and if you find an error, you can fill out an error-report right on the page.
and once someone has made a report, it's immediately visible to everyone,
even if it might take an administrator a little bit of time to fix the error...

now examine the plain-text versions of the files that created those books above:
> http://z-m-l.com/go/myant/myant.zml
> http://z-m-l.com/go/mabie/mabie.zml
> http://z-m-l.com/go/sgfhb/sgfhb.zml

you'll see how the pagebreak information was recorded in those plain-text files.
i think you'll also see how easily that pagebreak information can be eliminated,
for the situations where an end-user doesn't care about the original pagebreaks.

this is the kind of flexibility we want from our digitization efforts, so each group
gets the information they like, without inconveniencing what another group gets.

what is also useful about this format is that it's extremely close to what we get
_naturally_ when we scan a book, so it's not hard to go from scan output to final.

now, having said all _that_, let me proceed to my final point, which is a variant
on the "don't expect us to do your job for you". and it is _not_ our job to make
"a faithful representation of the print copy". we don't even _want_ to do that --
even if we could -- and we _cannot_, because any time you move a document
from one medium to a completely different one, you're creating a new edition.
whether you mean to do it or not. and like i said, at least from my perspective,
i don't even think twice about things like the correcting of typos. heck, i'll even
rework headers -- or even the _body_ of the text -- if that is what it takes to
make this _digital_version_ a _good_ digital version. i'm a republisher, who is
moving this book into a new medium for a new world in a new century, and
i'm going to do justice to the new. it's simply not my job to snapshot the old.
if you want to see what the old pages looked like, you can look at the scans.

so, anyway, there's some feedback for you to think about... :+)

-bowerbird

jharker
11-04-2007, 01:04 PM
Perhaps I'm missing something, but it seems to me that gutenmark does pretty much everything listed in the first post. In addition, it features output in LaTeX format, which means that with the right style file you can output your book with pretty much whatever formatting options you want.

How do your goals differ from gutenmark? That is, what would your program do that gutenmark doesn't?

bowerbird
11-04-2007, 01:33 PM
my scope is to give people a full toolchain for the entire workflow,
from initial authoring through web-publishing and on into remixing.

for that, you need a good format, and authoring-tools for that format,
and viewer-programs for it, and conversion-routines to other formats.

the goal of "making a typographically beautiful e-book" is simply one
of many issues which can be incorporated into the conversion aspect.

so my scope _includes_ that, but it also goes _far_beyond_ that.

to the extent that gutenmark helps automate the .html conversion of
project gutenberg e-texts _and_ helps the output become _beautiful_,
i respect it, and i respect it greatly.

but i'm doing more than that. so, for my purposes, it's not enough.
and since ron isn't maintaining it anymore, it never will be enough...
not for me, anyway. especially since i have a rather stringent set
of requirements that i expect of any e-book viewer-program i use:
> http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2004&post=2004-01-08,3
review my list, and observe that a web-browser falls laughably short.

if gutenmark is good for _you_ and your purposes, i'm happy for you,
and i have absolutely no desire to upset your applecart of happiness...
or if you prefer to use indesign, or word, or whatever, to make _your_
e-books beautiful, i laud you for bringing some beauty into the world...

-bowerbird

kovidgoyal
11-04-2007, 01:43 PM
um, to repeat, there's a good reason why no one will write an "improved"
gutenberg-to-html converter. it's the same reason that made ron burkey
give up on gutenmark, namely, the inconsistencies riddling p.g. e-texts.

until those inconsistencies are cleaned up, a converter is a pipe-dream...

however, once those inconsistencies _are_ cleaned up, we no longer need
to _convert_ the e-texts to _any_ other format, because their consistency
will mean that viewer-programs can be made to handle their native format.

this presents the existential conundrum of heavy-markup.
until it can be applied _automatically_, its cost is too high.
but once it _can_ be applied automatically, it's unnecessary,
because the very same routines that convert text to xhtml so
that xhtml can be rendered by a display-program can instead
be put into a viewer-app that eliminates the xhtml middleman,
by working directly with the text as its input to create its output.

once you understand this, deeply, markup becomes a bad joke.

we take simple text and turn it into complicated markup, and then
we need a complicated program to handle the complicated markup
and turn it back into simple text that can be displayed. it's just silly.

once i show people markup is unnecessary, they'll laugh at you for doing it.
and i don't say that to _mock_ you; i say it so you can avoid looking stupid...

-bowerbird


On automatically converting gutenberg e-texts:

There is absolutely no reason why a converter cannot be developed that handles most of the iconsistencies correctly. Your problem seems to be that you aim for perfect conversion of all texts. That's never going to happen. And how does inventing a new lightweight markup language (when there are already tons of them out there) solve anything? The gutenberg etexts are still going to have to be converted to that markup. ANy converter written by somebody who knows what he's doing will be designed to represent semantic information internally using an object model, then adding output formats will be trivial.

On using lightweight markup in general:

1. You think of html as "heavy" markup. Not everyone is as limited.

2. I'd have no problem with lightweight markup if all I cared about was simple texts with headings a few links and some images. I don't want my documents limited to the very small set of features imposed by lightweight markup.

bowerbird
11-04-2007, 04:24 PM
kovidgoyal said:
> There is absolutely no reason why
> a converter cannot be developed that
> handles most of the iconsistencies correctly.

i agree. in fact, i've developed that converter.


> Your problem seems to be that
> you aim for perfect conversion of all texts.

ok, here's the thing. why "handle" inconsistencies
when you can _remove_inconsistencies_entirely_?

i intend to mount a mirror of the p.g. library which
has all of their inconsistencies removed, so that
no other developers have to deal with that rubbish.

in other words, i'm doing what the "whitewashers"
at project gutenberg should have done all along,
i.e., ensured that their e-texts were _consistent_.


> That's never going to happen.

a perfect converter that handles all inconsistencies
might not happen, but we don't really need _that_.

we need a darn-good converter to clean up _most_
of them, and then we need to be _diligent_ about
finding and correcting inconsistencies that remain...

at the point where you have lots of developers who
are adding value to the library with new features
-- features that will depend on consistent e-texts --
the inconsistencies will reveal themselves naturally.


> And how does inventing a new
> lightweight markup language
> (when there are already tons of them
> out there) solve anything?

well, none of them seemed perfect enough for me.
specifically, they didn't seem "light" enough for me.
i want "zen" markup, maybe even "zero" markup...

even markdown, which is the best of the bunch,
often seems like an "abbreviated" form of markup,
and not the radical departure that i'm looking for...

and that became even more true when i factored in
the types of features that i wanted to be automatic.

for instance, i want the table of contents linked to
the chapter-headings automatically, with no work.
further, i want the chapter-headings linked _back_
to the table of contents, again without _any_ work.
plus, i want to let the users jump from one chapter
to the previous and next chapters, automatically...
even in the middle of a chapter, i want to let them
jump to the beginning of that chapter, and to the
beginning of the _next_ chapter, _automatically_...

i want a link from a footnote referent in the body
to its note in the notes section, automatically, and
i want an auto-backlink from there to the referent.
(and if there are two referents to the same note
-- it happens -- then i want auto-backlinks to both.)

and when there's a pointer-reference in the text,
such as a reference to "chapter 2", then i want for
that pointer-reference to be treated as a hotlink...

likewise, if there's a u.r.l., i want it to be a hotlink.

with the other forms of light-markup, you have to
code in all of those links manually. that's a pain...
avoiding such pain is the purpose of light-markup,
at least as far as i'm concerned. so i built my own.

plus, i did it as a puzzle, a challenge for my mind.
surely you can understand that? or maybe not...
because i just don't comprehend such questions...


> The gutenberg etexts are still going to have to
> be converted to that markup.

right. that's another reason i built my own version.
because i wanted it to be as close to "native" p.g.
as possible, to minimize the cost of bulk conversion.

as it is, the vast majority of most p.g. e-texts is
"already in" z.m.l. format. the big exception is
the front-matter at the top (e.g., the title-page).


> ANy converter written by somebody who knows
> what he's doing will be designed to represent
> semantic information internally using an object
> model, then adding output formats will be trivial.

i don't know what "an object model" is.

and frankly, i don't really care, not in the slightest,
since "adding output formats" is not a big concern.

and evidently i don't even need to know what it is,
because i've been able to do conversions just fine.


> 1. You think of html as "heavy" markup.

actually, i judge html as "medium" markup.
you have to jump to xml/css to be "heavy",
and go to .tei or docbook if you're serious.
but i dunno, maybe you are not "serious"...


> Not everyone is as limited.

nope. just 92% of the population. my user-base,
as i refer to them. i'm content to give up the rest.
heck, i'll be happy with "authors who wanna write,
and not have to waste time doing stupid markup."


> 2. I'd have no problem with lightweight markup
> if all I cared about was simple texts with
> headings a few links and some images.

evidently you haven't looked at my test-suite.

i can handle all the features commonly found in
the p.g. e-texts, and indeed in almost all books...

and when i discover a need for new capabilities,
i just invent a way for the format to handle it...
(and that's the _easy_ part. the difficult part is
coding the viewer-program for the new feature.)

and frankly, what i can't handle, i don't need...


> I don't want my documents limited to
> the very small set of features imposed by
> lightweight markup.

well, when you say _that_, you're just betraying
that you don't have a clue about light-markup...

(and, by the way, we do call it "light markup",
not "lightweight markup", because "lightweight"
implies what you are trying to say directly here,
i.e., that it is "limited" in some way, and it's not.)

markdown, for instance, lets you include _any_
(x)html code right in your markdown document
-- it just passes it on through without treating it --
so there's absolutely _nothing_ that you cannot
include, so there is no "very small set of features"
that is being "imposed" on you by the framework.

but even aside from that, the number of things
which cannot be handled within the _standard_
markdown framework is quickly vanishing away.

and if you include the additions to the standard
being implemented by stuff like multimarkdown,
you will find that you encounter no "limitations".

no offense intended, but if you want to criticize
light-markup, you will need do some homework.

-bowerbird

kovidgoyal
11-04-2007, 05:22 PM
You say that light markup (and you use markdown as an example) can handle anything by including xhtml which means a viewer app that is designed to view a lightweight markup language will have to parse xhtml anyway to display the file. In which case any viewer app advantage in using light weight markup is negated. Incidentally I actually use markdown and have even contributed patches to the python markdown project, so try not to jump straight to the "you dont know what you're talking about" defense. It leaves me with the feeling that you dont have any real points to make.

As for authors not wanting to learn markup. Those that are too lazy to learn markup will be too lazy to learn lightweight markup as well. They will demand a WYSWYG GUI to take care of the markup for them.

You have the attitude that creating a markup language that is just sufficient for all of todays needs is the right approach. You'll then "add more features" as you see the need. But it's not easy to "add features" to a lightweight markup language. Case in point is markdown and how you have to jump to html for any advanced features.

So yes, it is more effort to develop applications for authoring/converting/viewing a "heavy" markup language, but in the end its worth it. To say that we must limit ourselves to a lightweight language simply because developing applications for a heavy language is too difficult, is ridiculous. Let me leave you with the example of TeX. A publishing system that is not lightweight and that has lasted decades.

Lightweight markup is a good fit for gutenberg, but little else. And even there, I suspect they'd have a hard time getting their digitizers to follow the rules. As far as creating modern digital books, there is really no reason to be restricted to a lightweight markup language. And note that I continue to call it lightweight, because that is precisely what it is.

You say you want to maintain a mirror of gutenberg. An excellent idea. If you support export of gutenberg texts to HTML, I might even use it :)

kovidgoyal
11-04-2007, 05:24 PM
And if you've developed a converter, do you mind releasing it to the public, so that we can use it to convert gutenberg texts and see how well it does for ourselves?

jbenny
11-04-2007, 05:57 PM
I don't think it is accomplishing anything by replying to bowerbird's posts with questions and reasonable arguments. No matter what anyone says, his replies generally discount what anyone else says and accuses them of not knowing what they are talking about. He doesn't seem to be open to discussion or suggestions, but only in promoting his own way of doing things. I'm sure he will reply, denying this (and probably insult me in the process). However, his posts are the best evidence in support of my assertion.

One recent post in particular that illustrates bowerbird's low opinion of everyone who contributes to this forum (which I find full of very useful information): http://www.mobileread.com/forums/showpost.php?p=112355&postcount=56

bowerbird
11-04-2007, 07:15 PM
kovidgoyal said:
> You say that light markup (and you use markdown as an example)
> can handle anything by including xhtml which means a viewer app
> that is designed to view a lightweight markup language will have to
> parse xhtml anyway to display the file. In which case any viewer app
> advantage in using light weight markup is negated.

we seem to be talking past each other.

every light-markup system -- with the _exception_ of mine -- is
geared toward creating output formatted for an external viewer...

in some cases it's docbook, or .tei, or latex, but -- most usually --
it's (x)html, and it's aimed squarely at a web-browser as the agent.

so if you make a general statement about light-markup systems,
it will be interpreted with that understanding. if you want to say
they're "limiting", you're saying they're limiting _in_that_sphere_.

except markdown -- runaway market-leader in the genre -- has
_no_ limitations in that regard, since it can contain _any_ (x)html.

if you want to poke accusations at _my_ particular light-markup,
in the form of a claim that it cannot support every (x)html feature,
then you would be absolutely correct. but if that's what you meant,
then you should have said _that_.

and, in case i haven't said it before, or said it directly enough yet,
my particular system is aimed squarely at use for electronic-books.
i will support all the features needed by e-books, but nothing more.
and i'm aiming z.m.l. at _my_ viewer-program, not at a web-browser.

but heck yes, i can pass through (x)html just as good as the next format.
so if someone wants to use z.m.l. to target a web-browser, via the .html
conversion ability, then go ahead and include whatever (x)html you want.

so i'm still seeing absolutely no substance to your point. none at all.
but maybe we're still talking past each other... proceed if you wish...


> Incidentally I actually use markdown and have even contributed patches
> to the python markdown project

so then why did you say what you did, which was highly misleading?
you must have known it bordered on totally false when you said it...


> As for authors not wanting to learn markup. Those that are too lazy
> to learn markup will be too lazy to learn lightweight markup as well.
> They will demand a WYSWYG GUI to take care of the markup for them.

did you not read in this very thread where i said i'll give them wysiwyg?


> You have the attitude that creating a markup language that is
> just sufficient for all of todays needs is the right approach.

well, as i said above, the needs of _e-books_ in particular, and that's it.


> You'll then "add more features" as you see the need.
> But it's not easy to "add features" to a lightweight markup language.

well, i just disagree with you about the difficulty of adding features.
and since that's _my_ problem and not your problem, we don't need
to go back and forth about it. it's a "difficulty" i'm willing to handle...

but the fact is, i've done a lot of work up-front to make sure that
i was knowledgable about the features that i would actually _need_.
that's why i devised a test-suite. and i've lived with it for two years,
and i've convinced myself that it's sufficiently complete for the job...

(there _is_ stuff that might not be completely visible on its surface,
but i haven't yet put it in because i want to learn which observer is
smart enough to see the "shortcomings" and draw attention to 'em.)

moreover, i did the work of specifying the features that i demand of
my ideal e-book viewer-app, so i know what my format needs to do:
> http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2004&post=2004-01-08,3

given my preparation on both sides of the equation, i feel i'm covered.
i've also looked at a very large number of paper-books over the years,
so i'm quite confident i'm aware of the sphere of things that's needed.


> Case in point is markdown and how you have to
> jump to html for any advanced features.

i was years into development of z.m.l. before markdown even started.
and i am moving slower than they are, with more advance planning...
that means i can benefit from their experience, and i certainly have...

i also have the advantage that my scope is narrower than their scope,
in that my arena is electronic-books. at the same time, i have concerns
they do not have, including file-format interaction with my viewer-app.
so you really can't generalize from their experience to mine... sorry...

but i would also say you're misconstruing their situation just a little...
it's not my sense they had to "jump to html" for "advanced features".
i think they deliberately put in that option early, to retain simplicity.


> So yes, it is more effort to develop applications for
> authoring/converting/viewing a "heavy" markup language,
> but in the end its worth it.

hey, i'm glad you feel that way, so _you_ will bear the costs of that,
and _other_people_ -- perhaps even me! -- will accrue the benefits.

likewise, if other people are willing to pay the costs of heavy-markup,
then i have no objection to it. (except maybe a general dislike for cruft;
but, you know, if it gives me a ton of benefits, i can even live with that.)

it's only when _i_ have to pay the price of doing heavy-markup that i balk.

and, you know, i have been waiting for the heavy-markup advocates over
on the p.g. listserve to start marking up the e-texts for over 4 years now,
and they're still just as uncoordinated about the task as they've ever been.
indeed, they seem totally unwilling to do the job themselves, and instead
seem bent on trying to "convince" the p.g. volunteers to do it for them!...

needless to say, the volunteers aren't eager to pick up this complex task.


> To say that we must limit ourselves to a lightweight language
> simply because developing applications for a heavy language
> is too difficult, is ridiculous.

well, then, you know, maybe you should trot over to the p.g. listserves,
or maybe the d.p. forums, because they keep moaning the lack of tools
that would help 'em take on the complicated job of doing heavy-markup.

because you make it sound easy...


> Lightweight markup is a good fit for gutenberg, but little else.

well, yes and no. it's gonna be good for project gutenberg e-texts...

if i didn't believe that, i would not have put in several years of work...
and i certainly wouldn't be willing to convert the whole catalog myself,
and spend my time and energy on maintaining an independent mirror.

but i don't believe it'll be "a good fit" for "little else". indeed, i'm viewing
project gutenberg's corpus as mere "proof of concept" for a cyberlibrary
composed of the _tens_of_millions_ of books that google is now scanning.
i don't intend on maintaining _that_ myself, just giving them a good model.


> And even there, I suspect they'd have a hard time getting their digitizers
> to follow the rules.

are you reading my messages here? as i said before, most of the text in
almost all of the project gutenberg e-texts is _already_ in z.m.l. format...
that is, they are already "following the rules"...

there are usually a few inconsistencies in each one, which my routines
can find and fix -- automatically, for the most part -- so i'm satisfied...

now, of course, it would be far better if p.g. tracked down the glitches,
so _their_ versions would be completely consistent as well, but oh well,
at least i know mine will be. and other developers will learn that too...


> As far as creating modern digital books, there is really no reason
> to be restricted to a lightweight markup language. And note that
> I continue to call it lightweight, because that is precisely what it is.

perhaps you misunderstood... i just told you what _we_ call it, and why.
i don't really care what you call it. it doesn't really care what you call it.

and i don't care if you imply it is limited. or even if you say that directly.

as long as it does what _i_ want it to, the things _i_ consider necessary,
i will be happy with it. and i'm certain others will be happy with it too...
especially those authors who don't wanna waste any time doing markup.


> You say you want to maintain a mirror of gutenberg. An excellent idea.
> If you support export of gutenberg texts to HTML, I might even use it

of course i'll support conversion of my files to .html. and of course people
will use it. but they'll quickly learn that conversion is an unnecessary step,
because e-texts in the native z.m.l. format are a better e-book experience,
thanks to the high-powered z.m.l. viewer-program...


> And if you've developed a converter,
> do you mind releasing it to the public,
> so that we can use it to convert gutenberg texts
> and see how well it does for ourselves?

well, yeah, actually i _do_ mind "releasing it to the public".
i have no intention of releasing any source code, thank you.
but it's available for sale, with a price in the 6-figure range...

however, you will receive the _fruits_ of the conversion process
-- in the form of totally consistent e-texts in my z.m.l. format --
when i mount my mirror. but that'll be sometime down the line,
because part of that job involves reformatting of the front-matter.

and -- for the 4th time now -- if you want to "see how well it does",
then just visit the web-page that i gave up at the top of this page:
> http://z-m-l.com/go/vl3.pl
> http://z-m-l.com/go/zmldingus093.pl

the second of those two is a "live" converter, which you can use to
convert a project gutenberg e-text if you like. you'll have to clean it
up a bit first -- so that it's in z.m.l. format -- but then it'll work ok...
with, of course, the caveats i've given all along -- "in-progress", etc.

-bowerbird

bowerbird
11-04-2007, 07:23 PM
jbenny said:
> He doesn't seem to be open to discussion or suggestions,
> but only in promoting his own way of doing things.

what, precisely, is it that you think you have "taught" me?

i've been working on this for many _years_ now, and i know
what my system does. and you've got -- at the very best! --
a sketchy understanding. yet somehow, you think you can
come up with something that i haven't considered? my word...

heavy-markup advocates like yourself have been attacking me
from the very first time i ever uttered a word about this system,
and they've stayed in attack-mode for -- quite literally -- _years_,
and yet you think you've come up with something unique? what?

that's rich. i mean, that's really _rich_...

this is a serious question: what is it you think i've "discounted"?

-bowerbird

kovidgoyal
11-04-2007, 08:56 PM
To re-iterate my points which you haven't answered in your rather rambling response:

1) Light markup has minimal features. If you add more features your viewer apps will become more complex anyway. That negates your viewer argument. Heavy markup is heavy for a reason, it supports features. A design philosophy that limits features in order to improve program simplicity is the wrong approach in these times of ever increasing CPU power.

2) If authors use a GUI to generate ebooks, then they don't care about the markup, which then negates your argument for lightweight markup from the perspective of authors.

3) Lightweight markup is suitable for people who digitize books (like p.g.) but not for people who create books, since people who digitize/convert books typically don't care about advanced features, while people who create them do.

Some new points:
1) If you aren't open sourcing your code then good bye and good luck. All you're doing then is defining a specification. Any 10 year old that spends a week thinking about the requirements for an ebook format could do that.

2) Considering that you are designing a limited specification with closed source authoring/viewing software support for changes to that format (which will have to be made over time) will be spotty at best.

Finally:

When it comes to designing format converters, the key is the output format.
If you choose an output format that is a superset of all input formats you might consider, it is then possible to use the converter to convert all input formats to a single output format. You do this by using a object model internally in the converter software, with plugins for input formats. And it them becomes easy to output to different formats using the object model.

Starting with an output format that is more limited than possible input formats is simply ass-backwards. As I said before zml *might* be a good idea for conversion of txt files for p.g. but little else. And without an opensource converter from zml to html it is emphatically not a good idea.

jbenny
11-04-2007, 09:15 PM
jbenny said:
> He doesn't seem to be open to discussion or suggestions,
> but only in promoting his own way of doing things.

what, precisely, is it that you think you have "taught" me?

i've been working on this for many _years_ now, and i know
what my system does. and you've got -- at the very best! --
a sketchy understanding. yet somehow, you think you can
come up with something that i haven't considered? my word...

heavy-markup advocates like yourself have been attacking me
from the very first time i ever uttered a word about this system,
and they've stayed in attack-mode for -- quite literally -- _years_,
and yet you think you've come up with something unique? what?

that's rich. i mean, that's really _rich_...

this is a serious question: what is it you think i've "discounted"?

-bowerbird

Another post that not only doesn't address any of the points that he is supposedly responding to, but adds in his own persecution complex induced version of what he thinks was said.

And if you think you can get six figures for your system and code, then why are you over here, hassling everyone, who you think know less than you do? Sounds like it is time for you to take your medication again. Rant on. I for one will be ignoring your ravings from now on and hoping you go away.

bowerbird
11-04-2007, 10:06 PM
kovidgoyal said:
> 1) Light markup has minimal features.

you make that assertion, but you do absolutely nothing to support it.
what features are found in books that are lacking from my test-suite?


> 2) If authors use a GUI to generate ebooks, then they don't care about the markup,
> which then negates your argument for lightweight markup from the perspective of authors.

then it should be the case that i will have no users for my format.
so let's see if that's what happens. if it is, no sweat off your nose.
so why do you care?


> 3) Lightweight markup is suitable for people who digitize books (like p.g.)
> but not for people who create books, since people who digitize/convert books
> typically don't care about advanced features, while people who create them do.

once again, you seem to be awfully concerned about something that
should pose absolutely no threat to you, if what you're saying is true.


> Some new points:
> 1) If you aren't open sourcing your code then good bye and good luck.
> All you're doing then is defining a specification. Any 10 year old that
> spends a week thinking about the requirements for an ebook format
> could do that.

you can say and think whatever you like. but, you know, it's my time, and
i'm the one who decides how i spend it. and, for the people reading along,
i'm doing much more than "defining a specification". i'm giving you tools
to put that specification to work, turning plain-text files into e-books that
are beautiful _and_ have superior functionality. whether you care or not,
well, that's up to you... i don't expect everyone to care, maybe not anyone.


> 2) Considering that you are designing a limited specification
> with closed source authoring/viewing software
> support for changes to that format (which will have to be made
> over time) will be spotty at best.

again, if the format and the tools don't prove to be useful, right away and/or
in the long run, i suspect that there won't be a lot of people using it, correct?
so, ya'know, what's the big deal? lots of e-book formats have come and gone.
i've had a good time solving this little challenge. better than doing sudoku...


> When it comes to designing format converters, the key is the output format.
> If you choose an output format that is a superset of all input formats
> you might consider, it is then possible to use the converter to convert
> all input formats to a single output format. You do this by using
> a object model internally in the converter software, with plugins for input formats.
> And it them becomes easy to output to different formats using the object model.

and once again, i don't find that relevant to the work that i have done.
the work i _have_done_, as in _past_tense_, as in _already_completed_.

i'm not telling you that it's not useful for _you_, because it might well be.
but it's not useful for me. so you can make all the posts you want saying
_otherwise_, but you're not going to have an effect. that's all i'm saying...


> Starting with an output format that is more limited than possible input formats
> is simply ass-backwards. As I said before zml *might* be a good idea for
> conversion of txt files for p.g. but little else. And without an opensource
> converter from zml to html it is emphatically not a good idea.

look, i'm not telling anyone that they can't make an open-source converter
from .zml to .html. i've laid out a very simple spec, precisely so they _can_.
and i'll even help them if they run into any problems in attacking the task,
because i have done it, so i know how, and i believe they will find it to be as
simple and straightforward as i found it to be. i'll even give 'em a gold star,
providing they do the job right, in order to avoid confusion about the spec.
heck, if they do the thing correctly, i'll even host their converter on my site...

same goes for a converter to .pdf, or rocketbook, or mobipocket, or whatever.
if they don't, that's fine too, because i will. but anyone certainly _can_ do it...
heck, i'd even host a converter for .epub, just to show i got a generous heart,
if someone writes the silly thing.

and if someone wants to write a viewer-program, i'll help them do that as well.
and if they wanna make it closed-source, i don't care. i don't even care if they
charge people for it, since i'm giving away my viewer-program free of charge,
so if their app is so much better that someone will actually pay 'em for it, fine!
i'll even collect the darn money for them, because if they're making sales, then
it must be because they're doing _something_ right, and i want to reward them.

same goes with anyone else making any other programs that add value to .zml.

so, all in all, i'm not sure why you've got that bug up your butt... :+)

but i can take a good guess. because, like i said, i've gotten flak before...
there's a lot of technoids out there who've spent lots of time and energy
mastering the complexities of heavy-markup, and a simple system that
matches their benefits without imposing their costs is a threat to them...
it's a big threat to their expertise. so they attack me. but i'm very strong.
i have a thick skin, and i've been through it time and time and time again.

i went through _many_years_ of it over on the project gutenberg listserve.
unlike here, however, i didn't show my poker hand to people right away...
i just let them argue the points on a "theoretical" basis -- over and over --
so they ended up wagering all their credibility. over time, very gradually,
i introduced more and more evidence indicating that my system did work,
until now, when it's absolutely clear that they were wrong all along, they've
lost all of their credibility. so don't make the same mistake that they made.
there are plenty of holes in the in-progress proof-of-concept models that
i've made available. if you want to play this game, go and find those holes.
but if you wanna argue this on a "theoretical" basis, you'll lose to my demos.

as i'd tell my antagonists on the p.g. listserve, "the proof is in the pudding".
and i'm starting to dish out pudding. you can match me, or be left behind.

-bowerbird

bowerbird
11-04-2007, 10:09 PM
jbenny said:
> And if you think you can get six figures for your system and code,
> then why are you over here, hassling everyone

i'm not "hassling" anyone. this thread was started as an inquiry into _beauty_...

but you set out to make it ugly. why? don't answer, just go, like you promised.

as for a 6-figure pricetag, that's cheap. mobipocket got _7_figures_ from amazon.

-bowerbird

Nate the great
11-04-2007, 10:14 PM
jbenny said:
> And if you think you can get six figures for your system and code,
> then why are you over here, hassling everyone

i'm not "hassling" anyone. this thread was started as an inquiry into _beauty_...

but you set out to make it ugly. why? don't answer, just go, like you promised.

as for a 6-figure pricetag, that's cheap. mobipocket got _7_figures_ from amazon.

-bowerbird

:rofl:

I hope you aren't really comparing yourself to Mobipocket.

ebookie
11-04-2007, 10:21 PM
I'm hesitant to join into this discussion :knife: but since I've been thinking about some of these issues as well I figured I would put in a couple of comments.

First, the difference between semantics and presentation. So HTML (as a DTD of SGML) mixed these two with the notion that you were presenting documents in a browser of variable size. There is some notion of semantics (like H1 is a top level heading) and some notion of presentation (like B is boldface) and not a clear line between them. If the Project Gutenberg (PG) texts could be converted into something that identified just the semantics around the text then one could build formatter/presenters to "present" it on an electronic book.

Bowerbird's attempts are notable in that they attempt to embed semantics into a file as transparently as possible (which is a good goal if you might find yourself reading the file directly) but that feature makes it pretty challenging to screen automatically for errors. (For example if a bit flip causes the number 'M' (one bit different and <CR> in ASCII) to appear in one of the 5 lines between headers what does it do?) Does that screw up the presentation?

Now there is a standard way to solve this issue, its by using the stuff between SGML (very complicated) and HTML (very confused) called XML. Not XHTML but just XML. If the semantics of the book are automatically added into the PG text as XML tag pairs then three benefits will result:

1) An XML schema checker can validate that the semantics
are valid.
2) An XSLT style sheet can easily, and on the fly, convert the book
to ASCII, PostScript, HTML, Etc.
3) New style sheets can leverage existing annotated books to support
new formats.

Given the existing support for parsing and processing XML it would be straightforward (although perhaps not easy), to create a copy editing tool which sucked in a book, added its best guess at what the semantics were (and there is great work to leverage from the ZML work here) and then generate an annotated result. One might hope that all copy editors/proof readers can agree that something "Is a heading" without having to agree on how headings should be presented, or treated in the book presentation.

--Chuck

jbenny
11-04-2007, 10:27 PM
:rofl:

I hope you aren't really comparing yourself to Mobipocket.

Ah, but he is. Delusions of grandeur. Fits with his other delusions. Along with his repeated avoidance of responding to actual comments, but only responding to his twisted interpretation of what he thinks was said.

kovidgoyal
11-04-2007, 10:31 PM
Sigh this is a discussion about the merits of light weight markup, not an attack on you or your pet markup system. It's about trying to figure out whether spending time and effort on creating apps that support light weight markup is worth it.

1) Features not supported by light weight markup
- CSS float, boxes with custom borders, boxes with background colors for emphasis. Drop caps. I could go on.

2) I care because I am trying to drill into your thick head that light weight markup is not the best solution for ebooks.

3) Ditto.

1) If your tools are not open source you're not giving them to people you're giving people the ability to use them. A subtle, but important distinction.

2) Again the point of this discussion is to weigh the merits of light weight markup as a format for ebooks, not to decide whether you've spent your time wisely or not.

3) My concern was writing converters to zml not from zml. If you want to push zml as an ebook format, considering that there are currently no ebooks in zml you'd better worry about writing converters to zml not from zml.

Goshzilla
11-04-2007, 10:57 PM
There is already a converter for Project Gutenberg texts, it's called GutenMark. It takes the plain txt files and spits them out into html, the paragraphs are formatted correctly and certain things like chapter headings are given a formatted heading to make the text stand out from the paragraph text.

This pretty much will only work well on Gutenberg texts because that was what the program was originally written in mind for.

Maybe I just don't understand what the original poster was getting at?

jbenny
11-04-2007, 11:07 PM
There is already a converter for Project Gutenberg texts, it's called GutenMark. It takes the plain txt files and spits them out into html, the paragraphs are formatted correctly and certain things like chapter headings are given a formatted heading to make the text stand out from the paragraph text.

This pretty much will only work well on Gutenberg texts because that was what the program was originally written in mind for.

Maybe I just don't understand what the original poster was getting at?

I already suggested GutenMark. He shot it down as inferior to his method. As for understanding what he is getting at, I don't think anyone can.

Goshzilla
11-04-2007, 11:49 PM
Well after reading through the five pages, I'm extremely confused. Even more than when I naively suggested Gutenmark.

After having to manually make my own ebooks for Palm Reader format, PDF, and Ebookwise, Gutenmark is probably the best out there for doing that, otherwise it's alot of time wasted in Microsoft Word removing double lines and replacing with single ones, etc. etc.

Let me get this right though, the original poster is complaining about no formatting on Gutenberg Texts being converted straight into Microsoft Reader format? It just seems to me that it's a non-issue since using Gutenmark or doing the mass replace commands in Microsoft Word can allow for a readable Microsoft Reader edition.

There are even open source ebook readers that can do this job with no editing at all including the automatic line replacing and a quick table of contents.

Maybe I got all of this wrong too. This thread is getting more confusing with each post.

bowerbird
11-05-2007, 12:32 AM
goshzilla said:
> There is already a converter for Project Gutenberg texts,
> it's called GutenMark.

yes, gutenmark converts e-texts into .html.
if it serves your needs, that's fine with me...
you can just move along to the next thread.


> Maybe I just don't understand what the original poster was getting at?

i'm the original poster. the original post was
to elicit discussion on the various ways that
people make the ugly p.g. e-texts beautiful...

that's just a small slice of my own total aims,
so gutenmark doesn't do the job i want to do,
but as i've said, i respect both it and its creator.

i've written a lot of words here about my aims,
over and above a simple conversion to .html,
but if it's still unclear to people, then just pass,
because it does _not_ matter to my aims if you
or anyone else understands them at this time...

> Maybe I got all of this wrong too.
> This thread is getting more confusing with each post.

some people thrive on complexity, and
they will introduce it unnecessarily... :+)

-bowerbird

bowerbird
11-05-2007, 12:43 AM
for those who _do_ want to understand my aims, they're simple.

i've created a simple format authors can use to make e-books
which -- when displayed in my corresponding viewer-program --
have high-powered functionality and also render beautifully...

one example of the functionality that's provided _automatically_
is rich navigational links, beginning with the table of contents...

this same format will also be used for project gutenberg e-texts,
and i will convert the entire p.g. library to the format by myself.

one effect of this conversion of the p.g. library into my format
-- which is called "z.m.l.", short for "zen markup language" --
is that automatic conversions to other formats will be enabled,
specifically including .html, .pdf, and .ipod, and probably others.
to the greatest extent possible, these other formats will _also_
have the same high-powered functionality, like those auto-links.

another viewing option includes a web-viewer, now prototyped at:
> http://z-m-l.com/go/babelfish019.pl

-bowerbird

bowerbird
11-05-2007, 12:45 AM
written this morning, which now seems like ages ago...

***

since this conversation has ranged so widely, i'm gonna
fill in a few patch spots (or fill in a few spotty patches?)
before we put this thread to bed. i hope that's ok...

and if anyone has more comments on the original topic
-- things that are being done to beautify p.g. e-texts --
then do please feel quite free to throw them in as well...

-bowerbird

bowerbird
11-05-2007, 12:47 AM
first, a few things i forgot to mention on pagenumbers.

one very important aspects of pagenumber references
is that we need to consider them in our u.r.l. naming,
and the links there must have maximal transparency...

up above, i pointed you to these references:
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html

take the top one, and eliminate the first part, to get:
> myant/myantp001.html

you can see that the first 5 letters are repeated, so
eliminate those as well, and strip off the suffix, for:
> myantp001

in my naming, the first 5 letters reference one book.
in this case, it's "my antonia", the book by willa cather.

the "p001" part of the u.r.l. indicates this is page 1...

and just so you know, this u.r.l.:
> http://z-m-l.com/go/myant/myantp001.html
is based on the page-scan with this name:
> http://z-m-l.com/go/myant/myantp001.png
which, once again, is the page-scan for page 1.

and i rigorously follow this convention throughout.

so this is the u.r.l. for page 123:
> http://z-m-l.com/go/myant/myantp123.html

and it's based on the page-scan with this name:
> http://z-m-l.com/go/myant/myantp123.png

thus, any competent fourth-grader is capable of
figuring out the u.r.l. for _any_ page in this book.

furthermore, this means that when i encounter
some other p-book in the historical archive that
makes references to this edition of "my antonia",
i can relate those references to my e-book easily.

for instance, let's say that a passage runs like this:
> on page 189 and 198, cather ascribes qualities
> to antonia which seem to be inconsistent with
> those which were ascribed on page 15 and 83,
> and are completely contradictory to what cather
> clearly states on page 111. however, this could
> be due to the revelation which antonia has, that
> is described in detail on pages 144 and 157.

so, based on my transparent and consistent naming,
it's a simple exercise to create links for this passage:
> http://z-m-l.com/go/myant/myantp189.html
> http://z-m-l.com/go/myant/myantp198.html
> http://z-m-l.com/go/myant/myantp015.html
> http://z-m-l.com/go/myant/myantp083.html
> http://z-m-l.com/go/myant/myantp111.html
> http://z-m-l.com/go/myant/myantp144.html
> http://z-m-l.com/go/myant/myantp157.html

you would be _astonished_ how many cyberlibraries
have messed up their naming-schemes, such that a
simple plug-in-the-numbers strategy doesn't work.

google gets it kind-of right, but almost everyone else
gets it wrong, wrong, utterly and completely _wrong_.

and because of their confusing naming conventions,
scholars will have to go back and muddle through
_each_and_every_ reference like this, to find out how
the exact link for each one is specified in the e-book.
this is nothing less than sheer and massive stupidity...

-bowerbird

p.s. and, for the record, notice how completely useless
a p.g. e-text -- which was stripped of pagenumbers --
will be for a person who encounters the above passage.

kovidgoyal
11-05-2007, 01:05 AM
@bowerbird
I notice you've once again produced a flood of verbiage and not bothered to answer any of my concrete points.

bowerbird
11-05-2007, 03:04 AM
concrete points? i guess i missed them. at any rate, the proof is in the pudding.
if you're right, my library won't work. so there's no point to any discussion here.

so, as they say, have a nice day... :+)

-bowerbird

bowerbird
11-05-2007, 03:30 AM
gee, it doesn't appear i've posted all the messages
that i've written. nonetheless, i'm sure it will seem
like i didn't address the "concrete points" anyway.

still, i'll send those messages some time.
maybe tomorrow. maybe the day after...
but we did enough back-and-forth today.

-bowerbird

kovidgoyal
11-05-2007, 03:32 AM
It feels nice to win an argument. You do bring out the child in me :-)

bowerbird
11-05-2007, 03:36 AM
i'm glad you feel that you won.

maybe it'll mean you back off...

-bowerbird

kovidgoyal
11-05-2007, 04:26 AM
I actually meant that as an explanation for why I was being so insistent, not a declaration of victory. I'm still looking forward to what you have to say in response to my last post.

Panurge
11-06-2007, 12:48 AM
[For XHTML markup, one thing that comes to mind (just off the top of my head) would be to enclose all the text that makes up an original page with a surrounding tag that uses the "id" attribute to hold the page number. This would not display, but could be accessed if needed. Also, by using "id", you could construct a special hyperlinked table of pages that would allow you to jump to specific pages in the ebook. I'll have to try this and see how it works.]

Some such solution might satisfy everyone. Current scholarly journal databases such as Project Muse give the page numbers in square brackets within the text--an "ugly" solution, I suppose, but a simple one. JSTOR, the dominant archive of scholarly journals takes a different tack. It uses searchable PDF files and presents a scanned graphic representation of the original journal page, so the pagination problem is not an issue. However, the downloaded PDFs don't look all that great on the Sony Reader, though they are usable.
Sorry to have caught up with the conversation so late; I don't get a chance to log on to the forums every day.

jbenny
11-06-2007, 01:41 PM
Current scholarly journal databases such as Project Muse give the page numbers in square brackets within the text--an "ugly" solution, I suppose, but a simple one. JSTOR, the dominant archive of scholarly journals takes a different tack. It uses searchable PDF files and presents a scanned graphic representation of the original journal page, so the pagination problem is not an issue. However, the downloaded PDFs don't look all that great on the Sony Reader, though they are usable.


Although neither is ideal, both methods could easily be done in an epub ebook. The first would be very simple, but "ugly" as you say. Including a scanned image of each page (PDF, PNG, JPG, etc.) that is linked from the XHTML text is also possible. This would of course make the epub much larger and more work to construct.

I haven't had the time to think about other ways to do this, but there is probably a good way to do this strictly in XHTML, without having to include scans or put visible page numbers in the text. Perhaps someone else can suggest something?

BTW, this may be a good topic to split out into its own thread.

Edit: Nevermind. I'll create a new topic for it myself.

bowerbird
11-06-2007, 01:44 PM
panurge, great to have you back. i was worried that
the temperature in here had driven you away... :+)

at any rate, i wrote another message on pagenumbers,
and will go dig it up to post it shortly...

in the meantime, here is a quick summary of various
projects of mine -- in various states of polish -- which
are available in some form online or by-request...

perhaps this will give people an idea of my scope...

i invite the skeptics to go find the flaws in my work,
and report them in great detail... ;+)

-bowerbird

================================================== ====
the proof is in the pudding.
================================================== ====

for the latest version of this pudding sampler at any time, please visit:
> http://z-m-l.com/go/pudding_sampler.html

================================================== ====
the z.m.l. tool-chain is now starting to cohere across the workflow,
so here's a reminder about the pudding samples available currently.
all of these are in-progress, so constructive criticism is welcomed...
================================================== ====

babelfish -- prototype web-app viewer-program for z.m.l.
> http://z-m-l.com/go/babelfish19.pl

verylovely -- canned online zml-to-html conversion demo
> http://www.z-m-l.com/go/vl3.pl

zmldingus -- live online zml-to-html conversion app
> http://www.z-m-l.com/go/zmldingus093.pl

"continuous proofreading" mode: various sample books
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/tolbk/tolbkp001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html
> http://z-m-l.com/go/ahmmw/ahmmwp001.html
> http://z-m-l.com/go/goann/goannc001.html

.pdf samples -- sample of the zml-to-pdf conversion process
> http://z-m-l.com/oyayr/oyayr.zml
> http://z-m-l.com/oyayr/oya-sunday.pdf
> http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.zml
> http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01b.pdf

.html samples -- sample of the zml-to-html conversion process
> http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.zml
> http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/alice01.html

show_scan-set -- web-viewer modified specifically for viewing otherwise-raw scan-sets
> http://z-m-l.com/go/sss.pl

iphone -- web-viewer modified specifically for the iphone
> http://z-m-l.com/go/babelfishi20.pl

iphone -- reading a scan-set (e.g., page images) on the iphone
> http://z-m-l.com/go/babelfishi20.pl

give -- cross-platform offline viewer-program for z.m.l. (dated now, but...)
> download from the "zml-talk" group at yahoogroups

zandbox -- cross-platform offline z.m.l. authoring-tool
> e-mail me for a copy

banana cream -- cross-platform offline proofreading engine
> e-mail me for a copy

scrape/clean -- cross-platform offline proofreading engine
> e-mail me for a copy

-bowerbird
================================================== ====
the proof is in the pudding.
================================================== ====

JSWolf
11-06-2007, 02:05 PM
How is ZML useful to get a ZML marked up text into LRF and PRC formats so we can read them on our 505s and Gen3s/iLiads?

bowerbird
11-06-2007, 02:42 PM
jon, right now, it's not. very shortly, however, the .html conversion will be
solid enough for you to use as the rosetta-stone to leapfrog to other formats.

-bowerbird

bowerbird
11-06-2007, 03:17 PM
jbenny said:
> You bring up a very valid point that most of us don't think of
> (me included). Can you suggest a way to handle this
> without having the page numbers in-line with the text?
> Most of us would find the visible page numbers too obnoxious.
> For XHTML markup, one thing that comes to mind
> (just off the top of my head) would be to enclose
> all the text that makes up an original page with
> a surrounding tag that uses the "id" attribute
> to hold the page number

i admire the initiative that makes you jump in on this
problem that you haven't really thought about before.

a 3.2k lorem ipsum example isn't really needed, though.

many other people _have_ thought about it, for a while,
so a little exploratory research can go a long way here...
as they've already made a pass at providing solutions...

i've described mine -- and will repeat the links here --
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html
> http://z-m-l.com/go/tolbk/tolbkp001.html
> http://z-m-l.com/go/goann/goannp001.html
these demo e-books let you link directly to _one_page_,
where the text is available in easily-copied digital form,
and the page-scan is presented for reference as well...
a comment-form at the bottom lets people report errors,
or even make annotations to the page for others to see...

and again, these are all being done with my .zml format.
you can view the .zml files underlying the above books:
> http://z-m-l.com/go/myant/myant.zml
> http://z-m-l.com/go/mabie/mabie.zml
> http://z-m-l.com/go/sgfhb/sgfhb.zml
> http://z-m-l.com/go/tolbk/tolbk.zml
> http://z-m-l.com/go/goann/goann.zml

so, in spite of the people who would like to convince you
otherwise, here's some pudding as proof that light-markup
is quite capable of generating an e-book that works well...

so that's _my_ particular take on pagenumber retention...

***

i can point to other work too, and i am happy to do so...

i might as well start at the top, with la creme de la creme.

jose menedez has created "digital reprints" which _rock_.

you can download one here:
> http://www.ibiblio.org/ebooks/Einstein/Einstein_Relativity.pdf

that .pdf might _look_ unremarkable, upon first viewing,
but you'll find that the pagenumbers are actually _links_
that will open up the _page-scan_ for that specific page.

originally they opened up the exact page in the scan-set
at google, but it seems google changed their interface,
and now jose's nice links merely go to the first page.
there's a lesson there against depending on other sites...

so, as a more convenient option, you can use my scans.
using the number actually printed on the original page,
plug it into the following u.r.l. template to see the scan:
> http://z-m-l.com/go/einst/einstp001.jpg
in place of the "001", put the page you want. for example:
> http://z-m-l.com/go/einst/einstp089.jpg
will pull up the page-scan for page 89 from the p-book...

if you closely examine any page-scan, you'll observe that
jose's .pdf page is a very accurate replica of that page-scan.
the linebreaks are retained, down to end-line hyphenates.
the leading is almost exactly the same. so are the margins.
jose is an obsessive-compulsive guy; he gets the details right.

here's another digital reprint, this time geronimo's life story:
> http://www.ibiblio.org/ebooks/Geronimo/GerStory.pdf
compare any .pdf page with its scan by using this template:
> http://z-m-l.com/go/geron/geronp001.jpg
(as before, replace "001" with the page-number you want.)
by the way, google's scan-set from this book is the _worst_
job of scanning a book that i have ever seen from them...
it's worth downloading just for its humor as a bad example.

and finally, here's a third from jose, willa cather's "my antonia":
> http://www.ibiblio.org/ebooks/Cather/Antonia/Antonia.pdf
again, you can see the pagescan for any page on my site:
> http://z-m-l.com/go/myant/myantp001.jpg
(as before, replace "001" with the page-number you want.)

for the first two digital reprints, you can step through the
scan-sets more easily using my "show scan-set" viewer:
> http://z-m-l.com/go/sss.pl
"geronimo's story" is the one that comes up by default,
but you can choose the einstein book or the cather book
with the book-selection menu you will find on the page...
(and "my antonia" was also listed above in my examples.)

the quality of each of jose's "digital reprints", as a reprint,
is fantastic. you immediately see the pages are immensely
cleaner than the scans of those old library books, some of
which were subjected to careless markings by borrowers
who evidently were never taught to respect library books.
(then again, i guess that, over the course of 100 years,
there's gonna be _one_ borrower who simply _forgets_
that this was a library book, and not one of his own books.)

jose's tremendous quality gets _more_ remarkable as we
realize the digital reprint -- as opposed to the scan-set --
is _digital_text_, and thus can be _searched_ and _copied_,
meaning that it's infinitely more flexible than the scan-set.

and this all becomes truly mind-boggling when you further
realize the .pdf is 10-30 times _smaller_ than the scan-set,
which means it will run faster and use far fewer resources...

and yes, it takes some work to convert a scan-set into
digital text -- o.c.r. and proofing and formatting -- but
considering the huge benefits that result, it's worth it.

this, truly, is the direction our digital library should follow...

store a copy of the scans online, so people can refer to 'em,
to confirm for themselves that the digitization was accurate.
but give them, for their actual use, a file that's _digital_text_
-- for maximal convenience in our 21st-century cyberspace --
yet is capable of _replicating_ the original p-book _exactly_,
for the scholar-valued touchstone with previous centuries...

(that doesn't mean we have to _leave_ it in that form; we can
always remix it to our customization if we want to, since that
_remixing_ is part of the magic of a _digital_text_... but still,
we know if we want to replicate the p-book exactly, we can.
and there are times when we _do_ want exact replications...
it makes it much easier to know we're all on the same page.
sorry, but i can't ever resist throwing in that good old cliche.)

indeed, the biggest thing wrong with jose's digital reprints
is the reliance on .pdf, which is the "roach motel" of formats.
(that is, documents can go in, but they cannot come out...)

another problem is that jose builds his files using ms-word,
and doesn't make that original file available for us to remix.

in spite of these faults, though, jose's work is outstanding...

(and, just to connect the dots for you, my z.m.l. work is
designed to give the benefits while overcoming the faults.)

***

there's been other work done on retaining pagenumbers too.

here's yet another version of our good old standby, "my antonia",
which uses an x.m.l. approach to store pagenumber information:
> http://www.openreader.org/myantonia/basic-design/myantonia.html

by the way, this is the strategy that led me to make point #14
about not putting pagenumbers in-line inside the body-text...

but, on the _positive_ side, note that this document also
allows a person to click out to view each scan for reference.

also of interest, although i'd hope this degree of markup
becomes unnecessary in the future, with better browsers,
observe that each paragraph has its own "i.d." reference,
thus allowing a link to be made to a specific _paragraph_...

(should we next expect an i.d. reference on every _word_?)

***

and last but not least, because they've actually done _the_most_
work on retaining pagenumber information, you need to look at
the .html versions of the books _distributed_proofreaders_ does
for project gutenberg. over the course of the last couple of years,
most of the postprocessors there have moved to the position that
they believe pagenumbers _should_ be both saved and displayed,
so nearly all of the .html versions posted to p.g. lately have them...

unfortunately, the p.g. version of "my antonia" does not have an
.html version -- sad, the absence of automatic conversion, eh?,
perhaps someone could use gutenmark to make one for them --
so we can't compare their version of it straight across the board...

so let's take p.g. e-text #22222, as a demo, to pick a fun number:
> http://www.gutenberg.org/files/22222/22222-h/22222-h.htm

you'll see that, yes indeed, they've retained the pagenumber info.
and, unlike the x.m.l. example above, they have used their c.s.s.
to move the pagenumber out into the margin, and turned it gray,
so it's less conspicuous and distracting. so those are good moves.

moreover, if you really want a very good idea of exactly where the
pagebreak occurred, you can drag your cursor across the line and
observe exactly where in the line the pagenumber gets highlighted.
for example, if you scroll down to page 20, and do this little trick,
you'll find the pagebreak occurs between "practitioners" and "is".

(you could "view source" if you want, of course, but that's clumsy.)

what that _means_ is that -- in spite of where it is being displayed --
the pagenumber actually exists in-line, right in the body of the text.

unfortunately, what _that_ means is that, when you _copy_ the text,
the pagenumbers are mixed in, which we already said is a bad thing.

for instance, if you copy out the text around pagebreak 20, you get:
> and although applied to all graduate medical practitioners [20]is,
> in all other realms of learning, a degree awarded for graduate work
eewh! see that pagenumber in the middle? that's not what we want!

however, the problem isn't limited to a hassle when doing remixing.
these pagenumbers intermingled in the actual body-text can _also_
cause problems when the end-user performs a _search_ on the text.

so, for instance, if you do a search for "practitioners is", you will _not_
get a hit on that sentence that straddles page 20, because there is a
pagenumber between those two words.

(ironically, if you search for "practitioners [20]is", you _do_ get a hit;
but of course if you knew that that text is at pagebreak 20, then you
didn't need to search for it, did you? you'd just go right to page 20.)

i googled to see if a search on "practitioners is" would
bring up the .html version of e-text #22222. it didn't.
but more experimentation revealed that i couldn't do
_anything_ to fetch the .html version. the .txt version
came up just fine. but no search would find the .html...
so that's a mystery to me...

these twin usability problems aren't _showstoppers_, but they _are_
"glitches" that should be cleared up, if someone has an idea _how_...
if you are that someone, hustle over to d.p. and help them out, ok?

***

at any rate, here we have some ways to give scholars pagenumbers...

if you have any feedback on any of these systems, i'd love to hear it...

-bowerbird

bowerbird
11-06-2007, 03:28 PM
in that x.m.l.-based version of "my antonia" i discussed above,
i forgot to provide an example of a link direct to a paragraph.

here's one:
> http://www.openreader.org/myantonia/basic-design/myantonia.html#p0251
you should read the paragraph directly after that one as well...

-bowerbird

kovidgoyal
11-06-2007, 03:32 PM
jbenny said:
so, in spite of the people who would like to convince you
otherwise, here's some pudding as proof that light-markup
is quite capable of generating an e-book that works well...



Nobody says that lightweight markup cannot generate *an* ebook that "works well". The question is whether lightweight markup is suitable for *all* ebooks. A question you have still failed to address.

bowerbird
11-06-2007, 03:40 PM
i expect to handle 99% of the books in the p.g. library.

and handle them well. indeed, i expect my viewer-app
will give performance that is surpassed by _no_ others,
and which is _far_superior_ to most... of course, i also
hope those other viewers improve, to the point where
they are no longer surpassed by my app, or any other.
the world of e-books only suffers when viewers are bad...

-bowerbird

bowerbird
11-06-2007, 03:44 PM
kovidgoyal, i have substantial replies to your previous posts,
which i would like to post, but i don't want to _monopolize_
the conversation here. i'd like to give other people a chance.
when two people overtake a thread, it can get boring fast...

so if you resist the urge to address every point right away,
it would be good. i promise you'll have lots of chances later.

-bowerbird

kovidgoyal
11-06-2007, 03:55 PM
but i don't want to _monopolize_
the conversation here.

Are you sure? ;-)

bowerbird
11-08-2007, 04:31 AM
kovidgoyal said:
> 2) I care because I am trying to drill into your thick head
> that light weight markup is not the best solution for ebooks.

first off, let me say that i really enjoy language like "drill into your thick head".

that kind of cartoonish imagery is a tip-off -- to me anyway -- that there is
a good sense of humor that's operative in this discussion. which is important.

although some people might tend to interpret stuff like that _seriously_,
that's a mistake. way back, i'm glad i learned the 3 rules of cyberspace:
> 1. don't offend people.
> 2. don't _be_ offended.
> 3. of the first two rules, the second is much more imporant.

i _could_ become offended instead, but what would that net me?
just some high blood pressure, and a pissy attitude towards life.
who needs that? i would _much_ rather stay cool as a cucumber.
especially since, when you get emotional, you say stupid stuff,
because you're not thinking straight, and you lose the argument.

and the _best_ part of all of this is that, even if kovidgoyal _was_
trying to dish out an insult that he wanted me to take personally,
i've beaten him at his own little game by deftly sidestepping it...

there's a lesson in there for all of you...

-bowerbird

bowerbird
11-08-2007, 04:38 AM
i held up one response i'd written to kovidgoyal, to let others
have a chance in this conversation, without realizing that i'd
included a few responses to other people in that same post,
so here they are, separated out now...

***

nate said:
> I hope you aren't really comparing yourself to Mobipocket.

well, i'm not _french_, so i don't have the cool _accent_...

plus, in case you didn't notice, there _is_ a small difference
between _6_ figures (which is my minimum asking price)
and _7_ figures (which mobipocket was actually sold for)...

but if there's something you think mobi programmers did
that another programmer cannot do, well, then what is it?

(you can forget the d.r.m., because i'm religiously opposed.)

besides, the absence of a mobi mac version means i'm unimpressed.

***

chuck said:
> If the semantics of the book are automatically added into
> the PG text as XML tag pairs then three benefits will result:
> 1) An XML schema checker can validate that the semantics are valid.
> 2) An XSLT style sheet can easily, and on the fly, convert the book
> to ASCII, PostScript, HTML, Etc.
> 3) New style sheets can leverage existing annotated books to
> support new formats.

these are pretty much the standard arguments for x.m.l. markup...

and yeah, the main problem is step #3, in that there aren't any
"existing annotated books", and no easy way to create them...


> Given the existing support for parsing and processing XML

if you are aware of tools that actually provide "existing support",
you might want to go over to the distributed proofreaders forums
and let them know how they can use them, because they want 'em.


> it would be straightforward (although perhaps not easy), to
> create a copy editing tool which sucked in a book, added its
> best guess at what the semantics were (and there is great work
> to leverage from the ZML work here) and then generate
> an annotated result.

and this is the main reason why i'm _not_ turning my source code loose.
i don't want people using my hard-won routines to create x.m.l. markup.


> One might hope that all copy editors/proof readers can agree
> that something "Is a heading" without having to agree on how
> headings should be presented, or treated in the book presentation.

there's little disagreement on what the structures are in a given book.
human readers have figured that out fine for a few hundred years now,
thanks to the expertise of our whip-smart typographers along the way.

the difficulty is in programming this "intelligence" into a conversion tool.

the first tactic my antagonists over on the p.g. listserves tried to use was
that this required "artificial intelligence" that was too complex to program.

what i told them, and what i'll tell you as well, is that it's not all that hard...
you just have to work at it, and work at it some more, and then even more.

but it _can_ be done. i did it. and, you know, i'm not even mobipocket...

of course, you _could_ always just wait until i've mounted my mirror...

because then you'll be able to take advantage of the z.m.l. labeling of
every structure in every book, and use it to apply your heavy-markup.

oh, but then your heavy-markup won't be able to do anything more
than my light-markup, it'll just be more complicated to maintain and
more complicated for developers to add value. but, you know, you'll
have the library in the heavy-markup state you prefer, which is nice,
for you, i guess... maybe it'll help you you sleep better at night... :+)

but chuck, thank you, sincerely, for staying constructive in your post.

-bowerbird

kovidgoyal
11-08-2007, 05:29 AM
kovidgoyal said:
> 2) I care because I am trying to drill into your thick head
> that light weight markup is not the best solution for ebooks.

first off, let me say that i really enjoy language like "drill into your thick head".

that kind of cartoonish imagery is a tip-off -- to me anyway -- that there is
a good sense of humor that's operative in this discussion. which is important.

although some people might tend to interpret stuff like that _seriously_,
that's a mistake. way back, i'm glad i learned the 3 rules of cyberspace:
> 1. don't offend people.
> 2. don't _be_ offended.
> 3. of the first two rules, the second is much more imporant.

i _could_ become offended instead, but what would that net me?
just some high blood pressure, and a pissy attitude towards life.
who needs that? i would _much_ rather stay cool as a cucumber.
especially since, when you get emotional, you say stupid stuff,
because you're not thinking straight, and you lose the argument.

and the _best_ part of all of this is that, even if kovidgoyal _was_
trying to dish out an insult that he wanted me to take personally,
i've beaten him at his own little game by deftly sidestepping it...

there's a lesson in there for all of you...

-bowerbird

Sigh are you ever going to actually answer any of my points or just keep producing more meaningless verbiage.

Robert Marquard
11-08-2007, 10:49 AM
Sigh are you ever going to actually answer any of my points or just keep producing more meaningless verbiage.
No. He has done that on the gutvol-d mailing list for years.

bowerbird
11-08-2007, 01:25 PM
i've said it's coming, so just be patient, ok?

as was remarked on the thread split from here,
some people "stopped reading" because of the
"direction" in which this thread has progressed.

that's what happens when a thread devolves...

is it an effect you _want_ to make happen?
(like robert here?)

let the conversation breathe a little bit, and
let others have their chance to make posts.

-bowerbird

RWood
11-08-2007, 01:50 PM
I did a little research on the web and found some posts of bowerbird from 3 to 4 years ago where he said that he was almost there and said to just be patient. The same words he uses today.

He has submitted a book of his sayings to Gutenberg and it seems that many of the phrases and constructions he posts here are nearly the same as what he put in the book.

In short this one trick pony keeps doing the same thing over and over again.

bowerbird
11-08-2007, 02:16 PM
rwood said:
> He has submitted a book of his sayings to Gutenberg

um, no.

those were collected by one of my "fans" there.

it does give me a good chuckle to go read 'em
every so often. why not give people the u.r.l.?

i strung along the p.g. listserve for a long time
because people there were willing to risk betting
their credibility arguing against my light-markup
as long as it was "theoretical". so i was happy to
let them raise the stakes in our little poker game.

because i knew that once i started laying out the
actual _evidence_ sitting there on my hard-drive
-- "showing my hand", as it were -- i would _win_
the entire pot. and indeed, now my "critics" there
have nothing left to bet, because they've lost _all_
credibility they had, such that now when i deliver
regular updates on how my work is progressing,
there's no longer a peep of an argument from 'em.

so i suggest that you not lose _your_ credibility too.

i've already laid out plenty of evidence of my work:
> http://z-m-l.com/go/pudding_sampler.html
> http://www.mobileread.com/forums/showpost.php?p=112923&postcount=83

if you have criticism of it -- constructive or not --
i will love to hear it. but don't make the mistake
of thinking you can beat my hand until you check
the cards that i've already laid out on the table...

-bowerbird

bowerbird
11-08-2007, 02:17 PM
but hey, i love it when people have so much _passion_ about
what i'm saying that they start "researching" me on the web...

-bowerbird

bowerbird
11-08-2007, 03:44 PM
kovidgoyal said:
> Sigh this is a discussion about the merits of light weight markup,

no, it was a discussion about how to make p.g. e-texts beautiful.
but it got re-routed into something else, so i went with the flow...

but a discussion of light-markup in general is too far off the mark.
i'll respond to your points, because you've been so very impatient,
and insistent, but it's time to draw some lines to bound the topic...


> It's about trying to figure out whether spending time and effort
> on creating apps that support light weight markup is worth it.

it is? that seems kind of silly to me. no, _very_ silly.

you think it's _not_ worth time, so you _won't_ do it.

and i think it _is_ worth time, so i _will_ do it. so there.

everyone's happy. end of discussion. everyone's happy.

and down the line, i'll have a library in light-markup format,
and the world-at-large will then decide if that is worthwhile.

i suggest that you prepare a heavy-markup library to
compete with mine, because i'd hate to win by default.


> 1) Features not supported by light weight markup

are you trying to tell me what my system can and cannot do?
because it always makes me laugh when someone does that.


> 2) I care because I am trying to drill into your thick head
> that light weight markup is not the best solution for ebooks.

and i'm trying to share with people that i've found that it _is_.

and that's all i want to do, to _tell_ them. to share information.

because i'm way past the "discussion" stage on this little topic.
if you wanted to take part in that, you should have been on the
project gutenberg listserves for the last 4 years. because _now_
i'm at the "proof is in the pudding, and here's my pudding" stage.

and i don't particularly care if the message penetrates through
"your thick head" or not. it won't be decided here... or by us...
it'll be decided by the real people who actually use my library and
either (1) like it and continue to use it, or (2) don't like it and stop.
so your general opinion on the value of light-markup means nothing.
as does mine. this issue will be decided by real users in the real world.


> 1) If your tools are not open source you're not giving them to people you're
> giving people the ability to use them. A subtle, but important distinction.

that's exactly right. i'm giving them the ability to use the compiled apps,
and i'm not giving them the source-code. and that's exactly what i intend.

if you want the source-code to programs that do what mine do, write it...

i won't give you fish. i will teach you how to fish. but i won't give you fish.
and i couldn't care less if that bothers you or not. might even hope it does.


> 2) Again the point of this discussion is
> to weigh the merits of light weight markup
> as a format for ebooks, not to decide
> whether you've spent your time wisely or not.

no, the thread was created to talk about the various ways that people
bring typographical beauty to the ugly e-texts from project gutenberg.

i shared a list that i had made, and invited other people to add to it...

if you want to start a discussion about the merits of light markup,
go and start _that_ thread. but, like i said, i'm past that talk stage...
i'm creating pudding, and giving people samples so they can taste it.

but, please, if you have any questions about what z.m.l. can handle
-- any structure that is typically found in books, even only rarely --
then do feel free to ask me about it, and i'll tell you how i'd work it...

a hypothetical discussion of the general merits, though? no thanks.
i'm sure you know -- as a coder -- that after chewing on something
well enough to explain it in the detail required by a compiler, there is
something terribly unsatisfying about vague and general handwaving.

if you want to show me books from the project gutenberg library
you think i can't digitize, fine, bring 'em on. (they exist. about 1%.
i'm leaving those to the heavy-markup crowd.) but if you want to
throw out a claim that there are _many_ that i can't do, who cares?
i'm gonna prove you wrong with the pudding of the 99.2% i can do.


> 3) My concern was writing converters to zml not from zml.
> If you want to push zml as an ebook format, considering that
> there are currently no ebooks in zml you'd better worry about
> writing converters to zml not from zml.

maybe you didn't hear me say i will convert the p.g. library myself.
there will be approximately 15,000 books in z.m.l. format soon...

i've also created post-o.c.r. clean-up programs geared toward z.m.l.,
which people can use to turn google's scan-sets into nice zml-books.

and once authors realize how easy it is to make a kick-ass e-book
with z.m.l., the number of _new_ books in the format will explode.

so, while i'm certainly touched by your "concern" about writing
converters to z.m.l., i'd suggest to you that it's misplaced, and
perhaps you could find a more appropriate cause to care about.

-bowerbird

kovidgoyal
11-08-2007, 04:22 PM
@bowerbird
IOW you cannot respond to my concerns. Good bye and good luck.

bowerbird
11-08-2007, 04:28 PM
kovidgoyal said:
> CSS float, boxes with custom borders, boxes with
> background colors for emphasis. Drop caps. I could go on.

you know, i didn't even respond to this initially.
but i get the feeling that you think these points
are doing some kind of damage to my argument.
(otherwise, why _have_ you been so insistent that
i haven't responded to your points. i don't get it.)

the reason i didn't respond is because i hate
to rain on your parade with the big n-slash-a,
but "not applicable" is the only honest answer.

the big tip-off is that you went to the c.s.s. pile.

every one of your points here is _presentational_.

float? custom borders/colors? even _drop_caps_?

presentational. and _unimportant_ presentational.
nothing more than doodads, and almost trivially so.

i'm concerned with the _structural_ aspects of books.
so that's what my system puts into the file-format...

the _structures_ of a book are things like headers,
and whether a blob of text is a table, or a poem,
or a block-quote, or an epigram, or a dedication,
things like that. what it _is_. not what it looks like.

(some people call these "semantic" entities, but
that's a slight misuse of the term, in my opinion.
a chapter heading doesn't _mean_ anything --
which is how the word "semantic" is defined --
it simply _is_. so i use the term "structural"...
but where i _do_ agree with those other people
is that _presentational_ aspects are arbitrary,
and therefore do not need to be hard-coded.
i do not buy into their emphatic religion that
the "semantic" and presentational _must_ be
separated to the point of complete exclusion,
but agree any specific presentation is arbitrary.)

your issues here are completely presentational...

so, no, there's no way to code them in a .zml file.

further, that's because presentational options are
under the control of the _reader_, not the _author_.

that is, the author doesn't get to declare drop-caps.
or the coloring of boxes, or the corners on boxes, or
any presentational stuff like that. sorry about that.
(ok, not really. because, truth be told, my authors
won't even want to be bothered with stuff like that,
or they wouldn't start using z.m.l. in the first place.)

if the reader wants drop-caps, and the viewer-app
gives that choice to the reader, then it's the reader
who'll specify that choice, and have it displayed so.

and yes, even though drop-caps are a _doodad_,
i probably _will_ make 'em optional for the reader.
(they are not in any of my viewer-apps up to now,
and they are not high on the priority list of to-do's,
but they ain't on the bottom either, mostly because
they're gonna be really simple to write the code for.)

but as for custom-boxes and custom-backgrounds,
those _are_ indeed on the bottom of my priority list.

personally, i'm quite delighted by the sunken look of
quoted passages on many forum boards, like this one, and
i guess it would be very easy to program, but i'll still
hold off on it, because it seems to be complete fluff,
and i do not want to give the impression that i have
stooped to the low level of coding the complete fluff.)

anyway, so yeah, if you have anything _structural_
that you think my z.m.l. cannot do, please say so...

but presentational doo-dads, i have no time for...
(but sure, i'll put them somewhere on the to-do list
if an honest-to-goodness z.m.l. user requests them.)

-bowerbird

bowerbird
11-08-2007, 04:35 PM
kovidgoyal said:
> IOW you cannot respond to my concerns.

once again, you're just a little bit too impatient...


> Good bye and good luck.

that might be the best for everyone concerned... :+)

and i'm sure you'll take away the impression that
i "failed to respond" to your points. so be it...

the fact of the matter is i can't _find_ any points
that you've made that merit much response. sorry.

so if anyone else out there can see such points,
do please draw my attention to them. thanks...

because i'm quite confident i can answer them all.
after all, i was subjected to _years_ of questioning
over on the project gutenberg listserves, until now,
these days, i'm the only one left standing over there.

as i put it in a recent recounting of the history there,
in the early days, i was surrounded by a pack of dogs
that would bark and bark and bark at every post i made.

but lately, like in that sherlock holmes story where the
fact that the dog _didn't_ bark at the murderer became
the tip that led sherlock to solve the crime, the dogs
over on the p.g. listserve don't bark at me any more...

-bowerbird

kovidgoyal
11-08-2007, 04:37 PM
To summarize your response, zml cannot support setting presentational aspects. And you dont intend it too, ever. And since you aren't open sourcing it, there's no chance it will pick up features like that in the future.

HTML+CSS can support both structural and presentational aspects. They give the author more control and more freedom. As such, IMO they are a much better match for a *general* ebook format.

Look at it this way:

zml is forcing restrictions on authors. Indeed your whole attitude is that authors dont know whats good for them and you're going to tell them that.

HTML+CSS encourages authors to represent things semantically, but if they really want to add presentational aspects, it allows them to do so.

To me the second approach is simply superior. If you want to encourage authors/digitizers to use only structural markup, a better approach would have been to write an authoring tool that supports only structural elements, via the GUI and allows authors to "edit the source" for advanced features. Something like LyX does for LaTeX.

DaleDe
11-08-2007, 04:58 PM
Well to get back to the original theme I just finished reading a gutenberg book that was actually in fairly good shape. But even so it had some annoying problems still in it after I have gone through and beautified it once.

These included: punctuation without spaces. two sentences run together with a period and no spaces after the period. Spelling checkers are a great tool to find problems in scanned books but some of them won't find this since they have been taught (programmed) to ignore words of this kind since they might be filenames.

The second problem was paragraph splits where they didn't belong. The sentence was not over and the new paragraph started with a small letter. It should not have been a paragraph split.

Hopefully a program could detect this sort of thing.

Dale

jbenny
11-08-2007, 05:12 PM
Well to get back to the original theme I just finished reading a gutenberg book that was actually in fairly good shape. But even so it had some annoying problems still in it after I have gone through and beautified it once.

These included: punctuation without spaces. two sentences run together with a period and no spaces after the period. Spelling checkers are a great tool to find problems in scanned books but some of them won't find this since they have been taught (programmed) to ignore words of this kind since they might be filenames.

The second problem was paragraph splits where they didn't belong. The sentence was not over and the new paragraph started with a small letter. It should not have been a paragraph split.

Hopefully a program could detect this sort of thing.

Dale

Dale, GutenMark will take care of a lot of these types of problems with PG texts. Some of them can be fixed with a decent text editor with search/replace capability (regular expressions would work even better for some issues). No matter which software you use, a human being will still have to proof the result, if you want it perfect.

bowerbird
11-08-2007, 05:15 PM
kovidgoyal said:
> To summarize your response,
> zml cannot support setting presentational aspects.

ok, _you_ don't get it. i'm sorry about that.

but i think it's sufficiently clear to _other_ people.

so i won't prolong the discussion. i'll explain it again,
one more time, but after that, i'll leave you in the dark.
because it's not really important if _you_ get it or not...

the _philosophy_ of z.m.l. puts the _locus_of_control_
for _presentational_matters_ into the hands of the reader.

so, the things that would fall within the purview of c.s.s.
-- in an xml/css world -- are found in the _zml-viewer_,
_not_ the file-format. if you look for things like drop-caps
in the _file-format_, you're just looking in the wrong place.
(and, because of that, you won't find them there. surprise!)

this is part of a much bigger _philosophy_ that it is far more
efficient -- in the long-run -- and a much better _strategy_
to put intelligence into our _applications_, not our _formats_.
the problem with putting smarts in the _format_ is that you
have to then mold the content to the format, whereas if we
put the smarts in the _apps_, they'll parse the raw content...
as before, this is far too big a concept for us to _discuss_,
so i'm only just laying it out, because we cannot "decide"
the issue here, that's for the real-world to do, but i thought
some lurkers might be interested in the "big picture" of that.


> HTML+CSS can support both structural
> and presentational aspects.

so can zen markup language.

the structural aspects are in the file-format, and
the presentational options are in the viewer-app.


> They give the author more control and more freedom.

they give more control. they don't give more "freedom".

some people will say z.m.l. gives them the freedom to
avoid doing the unpleasant (to them) task of markup...
it is those simplicity-loving people i wish to empower.
but control-lovers who prefer xml/css can still use that.

there are certainly some authors out there who want to
control the reading experience of their audience. fine!
i have no beef with 'em. really! if you will kindly notice,
i have said that here on these boards, i am one of them!
i want to control the linebreaks that people see when they
read my posts. so i make it so they don't have a choice...
but you will also kindly notice that lots of people resent it.
(ok, maybe only _some_ people, but they resent it _loudly_.)

this divide -- between how much control an author wants
to exert over the experience of the product of their art --
already exists in the world of e-books today. some authors
are happy to make their text available so readers can mold it
into whatever form the readers want. other authors _insist_
on using .pdf, so they can control what every page looks like.

i don't tell authors which way is wrong or right. i don't care!

what i _am_ saying is that, if you're one of those authors who
is willing to hand control over to the reader, i've got a format
that makes your job of being an author _much_ easier for you.
if some authors like that, fine. if a _lot_ of authors like it, fine.
if no authors like it, fine. it doesn't make any difference to me.
my paycheck will be the same either way.


> zml is forcing restrictions on authors.

wrong. it is true that authors cannot use z.m.l. to deliver
custom-formatted books. but many authors do not care.

if an author feels that the "standard look" of a zml-book
crimps their style and "forces restrictions" on them, fine,
they're totally free to go elsewhere and use another method.


> Indeed your whole attitude is that authors dont know
> whats good for them and you're going to tell them that.

no, my attitude is that some authors don't want to do markup,
so i'm gonna give them a simple format so they don't have to,
but can nonetheless provide their readers with e-books that are
both powerful and beautiful.


> If you want to encourage authors/digitizers to use
> only structural markup, a better approach would have been

well, thanks for the suggestion. but as you can probably tell,
i already have some very firm ideas about what i want to do...

so i'm not really soliciting your suggestions... :+)

-bowerbird

bowerbird
11-08-2007, 05:27 PM
dalede said:
> Well to get back to the original theme

hallelujah! :+)


> I just finished reading a gutenberg book
> that was actually in fairly good shape.
> But even so it had some annoying problems still in it
> after I have gone through and beautified it once.

that happens...


> These included: punctuation without spaces.
> two sentences run together with a period and
> no spaces after the period.

yeah, those are pretty common problems,
especially in e-texts that were done early on.


> The second problem was paragraph splits where
> they didn't belong. The sentence was not over and
> the new paragraph started with a small letter.
> It should not have been a paragraph split.

although i haven't had very good luck from doing it,
the standard suggestion is that you report the errors.
maybe they'll get back to you, or maybe they won't...
and maybe they'll fix the errors, or maybe they won't.
the e-mail address for reports is "errata@pglaf.com".

i built a public error-reporting capacity right into
every _page_ of my library. i believe it's important.
i offered it to p.g., but they weren't interested. ok.


> Hopefully a program could detect this sort of thing.

punctuation without spaces? sure thing.
two sentences run together with a period? yep.
no spaces after the period? easy to locate.

all these checks -- and a lot more -- are in the
programs that i've written to do o.c.r. clean-up.

-bowerbird

bowerbird
11-08-2007, 05:29 PM
jbenny said:
> No matter which software you use,
> a human being will still have to proof the result,
> if you want it perfect.

on the other hand, if you want it "perfect",
it's best not to rely on a human being...

-bowerbird

TadW
11-08-2007, 05:40 PM
kovidgoyal said:
> To summarize your response,
> zml cannot support setting presentational aspects.

ok, _you_ don't get it. i'm sorry about that.

but i think it's sufficiently clear to _other_ people.

so i won't prolong the discussion. i'll explain it again,
one more time, but after that, i'll leave you in the dark.
because it's not really important if _you_ get it or not..

Why are you like this? Sorry, I don't get it. Just continue with your aggressive attitude towards highly respected MobileRead members, and for sure you'll be on everyone's ignore list - any time soon. :wall:

kovidgoyal
11-08-2007, 05:40 PM
kovidgoyal said:
> To summarize your response,
> zml cannot support setting presentational aspects.

ok, _you_ don't get it. i'm sorry about that.

the _philosophy_ of z.m.l. puts the _locus_of_control_
for _presentational_matters_ into the hands of the reader.



Surely you can make the leap to the next logical step. If a file defines only structural elements, the only elements that the viewer app has control over are those structural elements. Not all elements in an ebook are structural. Occassionally, there is a need for special formatting for isolated instances. zml + viewer will NOT handle this.


this is part of a much bigger _philosophy_ that it is far more
efficient -- in the long-run -- and a much better _strategy_
to put intelligence into our _applications_, not our _formats_.
the problem with putting smarts in the _format_ is that you
have to then mold the content to the format, whereas if we
put the smarts in the _apps_, they'll parse the raw content...
as before, this is far too big a concept for us to _discuss_,
so i'm only just laying it out, because we cannot "decide"
the issue here, that's for the real-world to do, but i thought
some lurkers might be interested in the "big picture" of that.


Umm so you're stating a philosophy as a motivation for the use of lightweight markup and then refusing to discuss its merits?


> Indeed your whole attitude is that authors dont know
> whats good for them and you're going to tell them that.

no, my attitude is that some authors don't want to do markup,
so i'm gonna give them a simple format so they don't have to,
but can nonetheless provide their readers with e-books that are
both powerful and beautiful.




Again a better way to give authors this power is to create an authoring tool, not a file format.

bowerbird
11-08-2007, 06:09 PM
tadw said:
> Why are you like this?

like _what_? i said he doesn't get it. because he _doesn't_. what _should_ i do?

and i said i don't care if _he_ gets it or not. because i don't. why should i?

because it's senseless to explain it over and over. it just drives everyone away.

-bowerbird

kovidgoyal
11-08-2007, 06:11 PM
tadw said:
> Why are you like this?

like _what_? i said he doesn't get it. because he _doesn't_. what _should_ i do?

and i said i don't care if _he_ gets it or not. because i don't. why should i?

because it's senseless to explain it over and over. it just drives everyone away.

-bowerbird

Be polite. And if you dont care whether people get what you're saying dont post on public forms.

bowerbird
11-08-2007, 06:24 PM
koyalgovid said:
> If a file defines only structural elements,
> the only elements that the viewer app
> has control over are those structural elements.

maybe viewer-apps in the way that _you_ conceive them
are such that they only follow directions given by the file.

but apps of the type that _i_ am building are not so dumb.
they will know a lot more than whatever the file tells them.


> Not all elements in an ebook are structural.
> Occassionally, there is a need for
> special formatting for isolated instances.

and here we are, once again, with the vague handwaving
about something that _might_ be needed _sometime_...

come up with something concrete in an actual p.g. e-text.

i've looked at those e-texts, lots and lots and lots of them,
and everything that i've seen in them, i know that i can do...

but, you know, i haven't exhaustively examined every one,
so if you can find something that my system cannot handle,
one way or another, i'll be quite happy to say "thank you"...
and then i'll go modify my system so it _can_ handle that...

but until you can do that, though, stop the vague handwave.


> zml + viewer will NOT handle this.

will not handle _what_? your imaginary boogieman? so what?


> Umm so you're stating a philosophy as a motivation
> for the use of lightweight markup and then
> refusing to discuss its merits?

look, i'm not asking you to _buy_ anything.
so there's no need to "discuss the merits"...
i just laid it out in case people were curious.

as i have said before, and will surely say again,
the proof is in the pudding. it's totally senseless
to debate _whether_ something will work or not.
build it, and if it works, it will be obvious to all...
and if you can't build it, or it doesn't work, then
_that_ will be equally obvious to all. talk is cheap.
working code is the standard i need for convincing.

and i'm writing that code myself, not asking you.
so don't waste my time "discussing the merits..."


> Again a better way to give authors this power is
> to create an authoring tool, not a file format.

i am creating _both_. and a whole lot more to boot.
i've pointed you and others to all kinds of my work.
if you have any criticism of _that_, i am all ears...
but i'm completely done with the vague handwaving.

-bowerbird

kovidgoyal
11-08-2007, 06:31 PM
@bowerbird

To remind you, I actually said that zml would be a good fit for p.g. txt files. But, for the umpteenth time, not for a general ebook format.

As for specific examples, I gave you specific examples which you dismissed as "being from CSS". So your attitude seems to be, if you're given a specific example you say "zml wont handle that because its specific". If you're then told that support for custom formatting of ebook elements is a feature missing from zml you say "I dont want to listen to you because you're not being specific".

bowerbird
11-08-2007, 06:38 PM
you're making this boring for the lurkers again. and being dishonest to boot.

if you want to say, "it's not good for a general e-book format", you'll have to
give a concrete reason why, or i'm not even going to bother to make a reply.

and i didn't "dismiss" your examples, i told you exactly how those things will
be handled in the z.m.l. environment, i.e., as user-options in the viewer-app.

custom-formatting by the _author_ is expressly _not_ supported by z.m.l.,
because the z.m.l. philosophy expressly gives presentation to the _reader_.

and i'm sure i've said all of that _clearly_enough_ so that a rational person
understands it perfectly well. and if i haven't, then maybe i just can't do it.
in which case people will have to pick it up intuitively when they use my apps.

-bowerbird

GregS
11-08-2007, 06:47 PM
I have just found this thread, and have only skimmed through a portion of it - I will read it more carefully this afternoon. Forgive these comments that may be off-topic.

Clearly marking up is the answer, but should it be dictated by ebook formats at all?

Gutenberg (the biggest project of its kind), is not and should not be seen simply as a resource for current ebooks. It is a resource of incredible value for many things yet to be seen. But the problem is that it is anchored in its past. Other collections are in html, but the variety of application proves problematical.

If you think novels are a problem, think about plays and poetry collections. Think also of the need to transform text into Voice Synthesised readings, the problem of reference quoting etc.,. the list of what may be wanted to be read, heard or otherwise used only gets more complex and unpredictable as readers become more widespread, and other means of dealing with literature are developed.

I would propose that the Gutenberg problem does not lie in marking up for ebooks, but rather a markup that allows easy translation to things like epub (a very good move).

It is not a matter of light vs heavy markup.

It is matter of finding a light markup that can be transformed coherently and consistently into heavy markup, they may include voice markup, reference markup, and complete structural markup, that is potentially well beyond what any present reader can handle.

Yet at the same time can be used in a minimalist fashion and allow greater complexity to be added by future editors.

I would suggest, that TEI (text Encoding Initiative) is the only candidate.

However, anyone looking at it would faint from apparent complexity of what could be done.

TEI.lite is only lite from a scholar's perspective.

However, it should be possible to prepare a consistent sub-standard compatible to translation to epub for instance.

So why bother? Why not just use something like epub?

The reason is that as a document is edited over time and more and more elements are placed in it the thing has to be consistent. It is easy to substitute the main element names ect., to say epub, it is just as easy to ignore all else (element wise), by simple filtering.

It is not so simple to add in elements into a more restrictive scheme - that is the primary problem. It must be a system that allows for growing complicity over-time.

I believe there is only one candidate. However, it needs to have simply implemented templates and there is no reason why the base markup should not be designed specifically for translation into existing ebook formats, or indeed good formats not yet used.

Now if this is done well there is no reason why source text markup cannot be translated on site as part of the download process. So instead of keeping at projects like Gutenberg multiple file types, it keeps one file type (TEI. ultralite) and translates on the fly what a reader may like to use (including varieties of plain text).

kovidgoyal
11-08-2007, 06:47 PM
@bowerbird
As I've stated repeatedly, it is not good for a general ebook format because it does not support custom formatting of individual elements. As you stated quite clearly in you last post.

Again, giving control of presentation to the reader software is in general a good thing. Forcing all presentation to be done only by the reader software, is not. It is limiting and short sighted.

@GregS

I agree, for archival purposes, of books that need to be digitized, a lightweight system that has standard structural elements is the way to go. Like you, I think it should be rooted in some sort of system that can be simplified considerably, but that is extensible, to allow for the bells and whistles that authors like to add.

bowerbird
11-08-2007, 07:15 PM
kovidgoyal said:
> it does not support custom formatting of individual elements

stop misrepresenting the facts. it's dishonest.

z.m.l. supports custom-formatting of individual elements by _the_reader_.
you know, all those people who are actually _reading_ the book...

it does _not_ support customization by the _author_, nor does it _require_ it,
which might well appeal to writers who want to _write_ and not do _markup_.

the choice as to whether or not to use z.m.l. will be made by the author.
not me. not you. not kovidgoyal. nobody, except the author. thank you.

-bowerbird

RWood
11-08-2007, 07:20 PM
You say the discussion period is over bowerbird, you said almost the same thing several years ago. The time for "putting up" or "putting out" is now or never. The few examples you presented to us in HTML 4 on your web site do not convince us of the power of ZML, if anything it makes me believe that ZML is just a pipe dream in you mind.

You stated some years ago that you would open the sources and even claimed that ZML was developed under an open source license. That would lead one to assume that since you now will not release the source, a working source does not exist.

You have prattled on for a while here at MobileRead claiming victim status because everyone is picking on you. I can honestly tell you that everyone is not picking on you. Many have just set their defaults to ignore your posts because they have better things to do with their lives than listen to your unsupported boasts about technical, moral, and artistic superiority.

Now kovid, who you claim has little knowledge of formatting, conversions, or even ebooks, has developed a set of programs called libprs500. I have used it for many months with great results. Fictionwise has also adopted it and uses it for all of the Sony LRF formatted books they offer -- over 6,000 at last count and growing daily.

If, as you say, "the proof is in the pudding," then you several year old pudding has gone rancid.

kovidgoyal
11-08-2007, 07:26 PM
Sigh again with the accusations. zml does not support custom formatting of individual elements. When I make that statement it means that the *markup language* zml does not have support for specifying custom formatting of individual elements. Reader software will have support for individual formatting of *structural* elements not individual formatting of *arbitrary* elements.

Please read that paragraph three times before replying.

bowerbird
11-08-2007, 07:44 PM
greg said:
> I would propose that the Gutenberg problem
> does not lie in marking up for ebooks, but
> rather a markup that allows easy translation
> to things like epub (a very good move).

well, gee, that would be _nice_.

but the problem is that, in order to get .epub,
you _do_ have to do markup. quite a lot of it.
epub is xhtml/css underneath. (and not far...)

so yeah, it would certainly be lovely if we could
jump directly to epub without any markup, but
it's not really possible.

(you could also do what hadrien does at feedbooks,
which is to put the book into a structured database,
and _then_ churn out the .epub. he's doing markup,
he's just doing it another kind of way, via database.)


> It is matter of finding a light markup that can be
> transformed coherently and consistently into
> heavy markup, they may include voice markup,
> reference markup, and complete structural markup,
> that is potentially well beyond what
> any present reader can handle.

so now you want light-markup as a middleman.
at first glance, that's an appealing position too...
you avoid the high costs of doing much markup,
but get the benefits that heavy markup "promises".

and once again, it would be nice if you could get it.

but you can't...

well -- to be completely frank -- you kind of can...

my routines can turn a light-markup file _into_
a heavy-markup file, and do a fairly good job...

but let me tell you why i think that's a dead-end.

consider the whole set of routines that will successfully
convert the light-markup file into a heavy-markup file,
which is then input to another app (call it "program p")
for "purposes of presentation" (whatever form it takes).

_instead_, put that set of routines right in "program p",
so it inputs the light-markup file, does the conversion
_itself_, and then goes on to act on the converted data.

that's better, isn't it? you didn't have to convert it yourself,
because "program p" did it for you. you avoided the mess
of the intermediate file. (because, really, were you going to
keep both the light-markup _and_ the heavy-markup files?
because that's just a bunch of unnecessary file-overhead.)


> I would suggest, that TEI (text Encoding Initiative)
> is the only candidate.

oh sheesh! you want to jump _directly_
to the heaviest of the heavy, don't you? :+)

good luck with that. that's been the plan of
the technoid faction over at p.g. for... well,
going on 6 years now... going _nowhere_...

-bowerbird

bowerbird
11-08-2007, 07:46 PM
i could read it 100 times and i will not reply,
because i've gotten off that merry-go-round.

-bowerbird

bowerbird
11-08-2007, 07:49 PM
rwood, i'm not sure where you're getting your "facts",
but i'm not interested in the fight you want to pick.

i'm not even gonna correct all the falsehoods you stated.

-bowerbird

kovidgoyal
11-08-2007, 08:01 PM
i could read it 100 times and i will not reply,
because i've gotten off that merry-go-round.

-bowerbird

bye bye :)

bowerbird
11-08-2007, 09:37 PM
good.

now, for anyone who wants to know _if_ z.m.l. _can_ do something
-- something specific, something they _need_ -- feel free to ask me,
and i'll be happy to tell you if it can, and how you would accomplish it.

there are lots of people who seemingly want to tell you what z.m.l.
can _not_ do, but i don't suggest you ask them, because they simply
don't know my system like i know it. which only makes sense, yeah?

there are lots of things that z.m.l. cannot do. if you are an author
who wants to dictate the font(s) used in your book, you can't do it.
you can't dictate the fontsize -- not even the _relative_ font-size --
or the color of any of the text, or background color(s), or margins,
or the leading, or the pagesize, none of it, absolutely _none_ of it.
can't even make _suggestions_ about the settings of those things...

so, you know, if you _need_ those things, z.m.l. isn't right for you.

because all of those variables are controlled _solely_ by the reader.

oh, well, for the _record_, i have _considered_ a mechanism whereby
the author could make "suggestions" about some of those dimensions,
but i haven't made the decision whether i will actually _implement_ it.
of course, the final say in the matter will always rest in _the_reader_.

that is -- just for anyone who has been mistaken about it all along --
they'll be controlled by the _human_being_ who is _reading_ the book,
who i call "the reader". (when i'm talking about the viewer-program,
i call it "the viewer-program", the "viewer-app", or just "the viewer".
but when i say "the reader", i'm talking about the breathing human...
and it's that breathing human -- the one who is absorbing the words --
who makes the decisions about presentational aspects of a z.m.l. text.

-bowerbird

DaleDe
11-09-2007, 02:14 AM
good.

now, for anyone who wants to know _if_ z.m.l. _can_ do something
-- something specific, something they _need_ -- feel free to ask me,
and i'll be happy to tell you if it can, and how you would accomplish it.

there are lots of people who seemingly want to tell you what z.m.l.
can _not_ do, but i don't suggest you ask them, because they simply
don't know my system like i know it. which only makes sense, yeah?

there are lots of things that z.m.l. cannot do. if you are an author
who wants to dictate the font(s) used in your book, you can't do it.
you can't dictate the fontsize -- not even the _relative_ font-size --
or the color of any of the text, or background color(s), or margins,
or the leading, or the pagesize, none of it, absolutely _none_ of it.
can't even make _suggestions_ about the settings of those things...

so, you know, if you _need_ those things, z.m.l. isn't right for you.

because all of those variables are controlled _solely_ by the reader.

oh, well, for the _record_, i have _considered_ a mechanism whereby
the author could make "suggestions" about some of those dimensions,
but i haven't made the decision whether i will actually _implement_ it.
of course, the final say in the matter will always rest in _the_reader_.

that is -- just for anyone who has been mistaken about it all along --
they'll be controlled by the _human_being_ who is _reading_ the book,
who i call "the reader". (when i'm talking about the viewer-program,
i call it "the viewer-program", the "viewer-app", or just "the viewer".
but when i say "the reader", i'm talking about the breathing human...
and it's that breathing human -- the one who is absorbing the words --
who makes the decisions about presentational aspects of a z.m.l. text.

-bowerbird

While I applaud user choice there should be guidance in what the author intended. Bold, italics, font size and even swithing font can be a useful mechanism to let the user know they are now reading a letter, or a sign, or some other special effect that needs to be communicated.

I really like the way html started out back in version 3. It was great. The author hints about the weight and import of the data and the user controlled the presentation. No CSS where the author attempts to control everything and makes thing too complicated. What happened? (My web site is still built by hand with html.)

Well the source of the documents had to take over control of the documents and had to publish the page rather than present the data. Too bad, IMHO. And, for the user, the flashiness of the presentation overruled the accuracy or the content. I am amazed in the business community how much people with believe data presented in power point when they would challenge it on a type written page. I am in the minority it seems and you are even further from the main stream than I am it would seem.

Sorry, you post triggered a rant. I am better know.

Dale

bowerbird
11-09-2007, 06:03 AM
dalede said:
> While I applaud user choice there should be
> guidance in what the author intended.
> Bold, italics, font size and even swithing font
> can be a useful mechanism to let the user know

bold and italics are indeed things that the author indicates in z.m.l.
bold is represented with *asterisks*, and italics with _underscores_.
(now you know why i'm always using the underscores in posts.)

of course, the user can exercise an option to change the way that
bold and italics are _rendered_. *bold* might be rendered in red,
and _italics_ might be rendered instead with green underlined text.
the author can also use other characters to indicate special marking;
> $this$might$be$the$signal$to$indicate$computer$cod e.$
> `and`this`might`indicate`a`monospaced`font`should` be`used.`


> can be a useful mechanism to let the user know
> they are now reading a letter, or a sign, or some
> other special effect that needs to be communicated.

a letter or a sign would be set off specifically as a _block_.

here's an example, from the first page of p.g. e-text #22589:


The sign said:
~tab~~tab~ JUBILATION, U.S.A.!! ~tab~~tab~
~tab~~tab~ The doggondest, cheeriest ~tab~~tab~
~tab~~tab~ little town in America! ~tab~~tab~

The two aliens smiled at each other. Unaccustomed to oral conversation,
they exchanged thoughts.


> http://www.gutenberg.org/files/22589/22589-h/22589-h.htm

the "~tab~" thingee indicates a tab, just so you can see it there.

in z.m.l., if you have two tabs at the start of a line, and two at the end,
it means that line is supposed to be centered. further, when you have
several such successive lines, it means you've got a _block_... voila!

z.m.l. doesn't know what _kind_ of block it is, and it doesn't really care.

i've built routines that look for certain words in the text around a block,
to ascertain what _kind_ of block, words like "invitation" and "sign" and
"letter" and "note" and "warning" and "figure" and "table" and so on...

and the routines actually work very well, which _amazed_ me at first,
until i realized that authors will _generally_ inform their readers about
something out of the ordinary like this. it's not merely the typography
that indicates what it is, it's the author explicitly _telling_ the reader.
just like the author did in the example above. check it out for yourself,
across a number of books, and you will see that it's actually the case...

so i have no plans to include these routines in my viewer-app presently.
if later on, there arises some _need_ for the program to _identify_ certain
types of blocks, i'll put it in. but for the time being, i don't see that need.

but yeah, good point, and i think i've got that covered well enough...

-bowerbird

GregS
11-09-2007, 07:25 AM
We may have a misunderstanding here, as I said I have only skimmed the thread, I know nothing whatsoever of the system you are suggesting, I have no opinion on it - I won't express an opinion until I am acquainted with ZML. I have had a goodish look at epub, that uis why I mentioned it.

greg said:
> I would propose that the Gutenberg problem
> does not lie in marking up for ebooks, but
> rather a markup that allows easy translation
> to things like epub (a very good move).

well, gee, that would be _nice_.

but the problem is that, in order to get .epub,
you _do_ have to do markup. quite a lot of it.
epub is xhtml/css underneath. (and not far...)

so yeah, it would certainly be lovely if we could
jump directly to epub without any markup, but
it's not really possible.

(you could also do what hadrien does at feedbooks,
which is to put the book into a structured database,
and _then_ churn out the .epub. he's doing markup,
he's just doing it another kind of way, via database.)


The reason I am suggesting this particular approach to large literature repositories has nothing to do with epublishing or readers per se, though they have a natural place in any such digital library.

It is all about the text, how it is used is all about translation. The primary thing is that the text be properly structured for the widest possible uses now and in the future.


> It is matter of finding a light markup that can be
> transformed coherently and consistently into
> heavy markup, they may include voice markup,
> reference markup, and complete structural markup,
> that is potentially well beyond what
> any present reader can handle.

so now you want light-markup as a middleman.
at first glance, that's an appealing position too...
you avoid the high costs of doing much markup,
but get the benefits that heavy markup "promises".

and once again, it would be nice if you could get it.

but you can't...

well -- to be completely frank -- you kind of can...

my routines can turn a light-markup file _into_
a heavy-markup file, and do a fairly good job...

but let me tell you why i think that's a dead-end.

consider the whole set of routines that will successfully
convert the light-markup file into a heavy-markup file,
which is then input to another app (call it "program p")
for "purposes of presentation" (whatever form it takes).

_instead_, put that set of routines right in "program p",
so it inputs the light-markup file, does the conversion
_itself_, and then goes on to act on the converted data.

that's better, isn't it? you didn't have to convert it yourself,
because "program p" did it for you. you avoided the mess
of the intermediate file. (because, really, were you going to
keep both the light-markup _and_ the heavy-markup files?
because that's just a bunch of unnecessary file-overhead.)

I again I would have to look carefully at ZML before offering any kind of opinion on what you say. I don't understand how light markup can be converted into heavy as I see most markup (especially heavy) to be a human interpretation of the text, helped by programs but not within their ken to accurately create.

I don't understand the intermediate file thing. what I was suggesting was a standard very light (ultralight) use of TEI because in terms of Literature it is the most developed markup in existence. As editions of the same text are made of the source text, more markup for different purposes is added to it.

A fully marked-up TEI text is a huge amount of element tags in proportion to the text, at in Voice Synthesis tagging (TEI 5) and the thing is almost all tags.

There is no electronic efficiency in this, it is most inefficient. 99% of the element tags and attributes are not needed for any particular use, for ebook reading (against the concept of a fully marked-up TEI text) only a tiny proportion of the markup is of any use at all.

The virtue is that they are just tags and can be easily filtered for particular purposes. What is more the text becomes a multi-use resource, which is my point (databasing cannot do this).

However, with TEI it is possible to reduce it to just a handfull of tags, just enough in fact to translate into something as simple as epub, or for that matter ZML. Moreover, translating into PDF for printing etc.,. while not trivial at this stage poses no insurmountable problem.

The idea is not applicable to people selling ebooks, or making them. However, Gutenberg is much more than this, potentially it and others like it are a new Alexandrian Library. And that requires a scholarly approach to how to keep the texts in their most useful form for all sorts of predictable and unpredictable uses. If we are talking of marked-up text for a purpose like that TEI is it (the only system developed enough to thoroughly markup literature from manuscripts, scientific articles and novels, plays, corpus collections, dictionaries etc.,.).

I am not however talking about some mammoth operation to apply TEI, just the marking out of a simple cut down version compatible with building more into the markup as time goes by.

Making no reference to ZML, but to epub which I have a sleight acquaintance, it could well be the model of what such a cut down version should be, a stage one markup could well be nearly a one to one conversion, simply changing the element and attribute names and filtering out anything else.

I am saying this approach is the most suitable for storing literature as a long term cultural asset. Not that it helps in any way eink readers or anything similar (catering for their use, a thousand times yes, but not designing the text markup for simply reading them on current devices).



> I would suggest, that TEI (text Encoding Initiative)
> is the only candidate.

oh sheesh! you want to jump _directly_
to the heaviest of the heavy, don't you? :+)

good luck with that. that's been the plan of
the technoid faction over at p.g. for... well,
going on 6 years now... going _nowhere_...

-bowerbird

I am no techniod, but an academically inclined teacher, eager to have good Literature made available in a flexible and future proofed, form. At least at the start of the thread this seem to be the main concern, how to adapt Gutenberg's resources for this Second digital revolution.

I am the first to agree that coding fulling in TEI is a nightmare, that the editors for this are nowhere developed enough, and that trying to apply this form of markup by non-scholars is a recipe for disaster as it stands.

However, the system which is nearly fully developed TEI 5 is a huge improvement and far more extensive than anything before it, and solves the dire problem of multiple different and incompatible markup schemes being applied to the same text and in the same file.

There are ways it can be used in a very limited and cut-down easy to apply fashion, in fact this could be done quickly by just a handful of people familiar with TEI and the needs of such thing as ebook reading. There are also ways of using markup externally to the text file (a structural markup stylesheet).

The real virtue of TEI is its thoroughly developed element structure, and that it has been designed to cope with the most diverse textual material to a scholarly level.

I think we might therefore be at cross purposes. I will have a look at ZML when I get the chance, it is possible it might just be the thing, but I find it incredibly hard to imagine a simple solution to such a complex problem of marking up literature, storing it, and developing its structural analysis and use by applying more and more markup overtime.

I have a little experience in dealing with quality voice synthesis, which in the near future may well be put into handheld readers. I can say with some authority that TEI's markup solution is superior to any other approach I have come across (on a number of grounds).

Voice direction is not compatible to textual structure as one might assume. Speakers sometimes speak together (two different structures combined), voices may blend behind, sometimes a SFX may play behind any number of speakers, dialogues. What I am saying is that the nested nature of markup is necessary, yet adding voice markup can violate that. TEI has a compatible solution, I know of no other system that can mix the two.

Besides which markup a huge potential variety of text sources, from ancient epigraphs, parallel commentary, embedding translations of odd terms, musical notation, dramatic pieces etc.,. these things need a very elaborate system to do justice to the content.

Several basic standard types of severely reduced TEI markup would be an ideal solution for Gutenberg - HTML and epub, just cannot cut the mustard in the long term, nor should we expect it to.

GregS
11-09-2007, 07:37 AM
PS to add one aside - the problem of Unicode.

I very much favour, because we must deal with the languages of the world that we should move away from ASCII.

However, the problem is not the scripts whatever the language, but page decoration and punctuation.

In XML/XHTML/TEI whatever, entities solve a good deal of this because they are unambiguous. Fonts do not cover everything and even curly quotes pose problems when rendered into direct code.

I am suggesting that entity sets need to be applied heavily, to maintain the long term integrity of texts. It is at the cost of size, but an apostrophe is not a closing single quote, though the glyph is.

We cannot get the method of glyths mixed up with the system of original writing. How something might be rendered should not become a substitute for the mark being rendered.

GregS
11-09-2007, 08:42 AM
PPS

Is this the ZML?

http://rx4rdf.liminalzone.org/ZMLMarkupRules

If it is I have some preliminary opinions.

"Like the Wiki and SLiP formats its goal is to be a human-friendly markup language: simple, clear, and concise."

What I like about the bulkiness of XML is that it is passive and robust, and simple to repair - how simple, clear and concise it is from this point of view largely irrelevant. Beginning and end tags are bulky and inefficient, but they make things robust. A typo, an accidental deletion etc., can leave clues behind - more efficient marking up also means more fragile.

But no one in their right mind wants to directly work with XML tags ( or should not, I suspect that many that do are not in their right minds, or just trapped by current technology).

As a processing and composing language ZML has some virtues, but for me this is not enough, things have to be future proofed and robustly constructed - that means that redundancies (such as element tags for closure) that are in fact a good thing for storage, stablity, flexiblity and preservation, in terms of rendering transmission etc., the overhead of preprocessing XML into another form is well worth the costs.

ZML as a method no doubt has its uses.

Two other approaches I prefer, REBOL and LUA read XML directly into their data structures, which can then be manipulated by the script language like any other data. Both can save out as XML at any time.

I don't know if this makes sense in this context, but I favour lean applications and fat data. I have been with computers since the Apple II, from when everything was a squeeze. I don't really miss the old applications, but the loss of data still hurts, i have long thrown out a lot of things (half-written books, notes, articles etc.,.) I have written simply could not be read.

Digital Preservation is a critically important issue. The solution is robust redundancies, rich data, and most of all simplicity. The cost is fatter files, a little extra processing overhead - and it is all well worth the price, if we can preserve unambiguously what we already have.

Writing on paper disappeared only with the paper itself rotted away. No new improvements in the press, bindings or reading glasses effected what had been preserved.

What was written before did not disappear when a new pair of glasses were purchased. Until recently new digital glasses made previous words disappear. XML and XHML preserve some aspects well, but other standards are needed to preserve it better (TEI in the end I suggest for Literature).

ZML has its place, but not as a method of storage.

bowerbird
11-09-2007, 01:36 PM
greg said:
> Is this the ZML?
> http://rx4rdf.liminalzone.org/ZMLMarkupRules
> If it is I have some preliminary opinions.

um, no. most emphatically not.

i've pointed to my site several times:
> http://z-m-l.com

for the latest summary of the work available
-- almost all of it demos, proof-of-concept, etc. -- see:
> http://z-m-l.com/go/pudding_sampler.html

-bowerbird

bowerbird
11-09-2007, 02:08 PM
greg said:
> The primary thing is that the text be properly structured
> for the widest possible uses now and in the future.

that _sounds_ good. until you realize that -- depending on
how one defines "properly structured", and how one considers
"the widest possible uses", not to mention the crystal-ball on
"the future" -- doing heavy markup might be _very_ expensive.

so expensive -- quite literally -- that we cannot afford to do it.

heck, did you notice that -- until google decided to step in --
we couldn't even find funds to _scan_ the books in our libraries.
and scanning is dirt-cheap compared to applying heavy-markup.

and the other thing to keep in mind is that society is generating
new content at a numbing rate, a rate that's even ever-increasing!
and precious little of that content is marked up, not even in .html.

so, you know, in _my_ humble opinion (as people say), the idea
that we can make the _assumption_ that our data is marked up
-- and marked up with something intense like .tei -- is _silly_...
to the point of -- in my _humble_ opinion -- being ridiculous...
(a strong word, even extreme, but i think it is fully appropriate,
since -- as far as i can see -- this assumption has zero reality.)

would it be _nice_ if all our text could be extensively marked up,
such that it could magically be transformed any way we wanted?
well, _sure_ it would. it'd be _great_.

but, considered from the cost-benefit perspective that everything
must work under in our world, the benefits don't even come _close_
to justifying the very high costs of applying that extensive markup.

so we need something else, which gives us _most_ of the benefits,
at a _much_ lower cost. and that "something else" is light-markup.


> At least at the start of the thread this seem to be the main concern,
> how to adapt Gutenberg's resources for this Second digital revolution.

actually, the thread started out only as an attempt to
make a checklist of techniques that people have used
to make p.g. e-texts look _typographically_beautiful_...


> I will have a look at ZML when I get the chance,
> it is possible it might just be the thing

you're welcome to look at it, but i can pretty much tell you now that it
won't be a good fit, because your head wants an "ideal" markup system
-- which anticipates "any possible use, now or in the future" -- whereas
z.m.l. is fully grounded in the tradeoffs that a cost-benefit ratio demands.


> I find it incredibly hard to imagine a simple solution
> to such a complex problem of marking up literature, storing it

well, once you find out how little it costs, and the huge benefits it returns,
you might be surprised. but it's because you see the problem as "complex"
that i said you wouldn't be a good fit with z.m.l. you're one of those people
who love complexity. that's ok. doesn't mean that you're a bad person... :+)

-bowerbird

bob_ninja
11-09-2007, 05:49 PM
This discussion is too long to read. Can someone summarize for me if you actually came up with any software for "cleaning up" G. text files?
I started writing my own tool and would like to avoid reinventing the wheel.
thanks

kovidgoyal
11-09-2007, 05:55 PM
Short answer: no
Long answer: bowerbird claims to have a tool, but (as far as I understand) he's not going to release it, only use it to create his own mirror of p.g.

GregS
11-09-2007, 07:34 PM
bowerbird
"that _sounds_ good. until you realize that -- depending on
how one defines "properly structured", and how one considers
"the widest possible uses", not to mention the crystal-ball on
"the future" -- doing heavy markup might be _very_ expensive."

I do not understand this at all.

Full marking up in TEI is not being suggested, I am only suggesting a lightweight standard no more difficult than Xhtml or epub, but adapted for text repositories rather than display (though with CSS this poses no problem whatsoever - epub requires a trivial translation).

It was after having a look at Gutenberg Marker (which solves a lot of problems really very well), that a couple of extra steps would make such software very useful for establishing a standard ultra light TEI.

As for future proofing, you seem to miss the point altogether. Scholars have already developed concepts of textual structural analysis. If the structure can be unambiguously marked the text is future proofed, because one way or another it is the structure that has been the most elusive aspect of text and handling it for different purposes.

No magic involved, just progressively adding in tags by editors who do know what they are doing (TEI is as the texts are themselves inherently hard to do as the complexity of the markup meets the complexity of text itself).

I have taken at least seven or eight Shakespeare plays from Gutenberg, turned them into word processing documents, cleaned them and then through stylesheets reedited them and finally after a lot of effort (plays are really hard to do compared to novels) produce a pdf to print out copies for my students.

In short, though the tools are wrong for the purpose I have been doing just sort of thing because I had no choice - but the end result did not give anyone else useful. I could just as well have been properly marking up the play (using TEI derived tags) and placed back in Gutenberg something useful to others.

This is what I mean by progressively taging repository texts. Academics occasionally resort to Gutenberg, but whatever they do the text is lost to the repository as well. Students no doubt use the texts for study in their literature degrees, whatever they do is lost.

I have dealt with HTML and text versions of a variety of literature. For some purposes just being in HTML makes things very easy, but it also can make things a lot harder as well.

The other thing is that just reading texts (or printing them) is only one aspect of deigitizing texts. Storing them as virtual texts (with their structure preserved and readable) is vitally important for the preservation of literature. This is not just academic prejudice, it is what makes the texts adjustable to unpredictable future usage.

A chapter is not a title (a small criticism of Gutenberg Marker) it is a division that may or may not have a title. Hence I may in the future for whatever reason, desire to quickly retrieve Chapter Seven of "Pride and Prejudice" how can this be done unless the computer has a means of finding exactly what I asked it to find?

You don't need a crystal ball just an understanding of text itself from a scholarly point of view. These people have not been wasting their time, their precision is not useless but vital, and their knowledge (a part of which resides in the very code of TEI) cannot be ignored.

And I repeat an ultr-light version of TEI need be no more difficult than what we are already using, but it is not closed off like XHTML/epub or any other display technology. Being XML it is probably just as displayable in most contexts anyhow.

I looked at your references and had seen them, but as I could only see html markup in the source I went looking elsewhere. Sorry for the mistake.

My description of this thread as being about the Second Digital Revolution was not misplaced. The whole problem with Gutenberg at the moment is that it is rooted in the First, hence the compounding problems and the variety of solutions being proposed.

I have no prejudice against your system, except of course, until I trawl through this long thread I have no clear idea of what it is, I am based on other readers comments, not too sure I will be that much clearer if I do.

"you're welcome to look at it, but i can pretty much tell you now that it
won't be a good fit, because your head wants an "ideal" markup system
-- which anticipates "any possible use, now or in the future" -- whereas
z.m.l. is fully grounded in the tradeoffs that a cost-benefit ratio demands."

For me this just places it amongst display technologies, which is no bad place to be. The problems of text repositories is a different thing altogether.

Consider that a text is fully marked up in TEI (ie more tags than text). A huge labour but one that can accumulate over time in a systematic and reliable way. What is needed to translate it? "Find these tags and change them thus...." "Ignore every other tag" - the end result could be anything. As I wrote (with some help) something very similar in REBOL, I know that such a script be less than a page of code and is only a tiny delay in simply copying the file to a new location.

I also hold out some hope that students around the world may in the mid-term future look forward to having a device that is a notebook/reader capable of displaying TEI encoded documents in an academically useful way. However, the technology has to develop and for that it needs to establish a good market, for that the most important aspect is to establish standards such as epub, which may not be the most efficient or versatile, but make it possible to buy and keep literature with some assurance that on future devices it is either directly readable or can be made so.

My opinion, is that efficient display codes as you propose are not the real problem. I don't doubt everything you say about it, I have serious reservations that it answers the right question.

However, as a display technology it may well have a place. I would ad the proviso that if it can easily translate from epub and to epub then this would be a vital attribute in its acceptance, especially if it is as easy as you say to code with it.


To everyone else, please carefully consider the idea, though it be far removed from mobileread of the separate problem of text repositories.

For my part, when time permits, I intend to look carefully at epub and try and make a version of TEI to fit it, and then write a small program to translate it one way and the other.

If this looks any good in the end, I will set up a site, make people aware of it here, at various text repositories I know of and of course the TEI consortium. However, it is at least a month before I can seriously sit down with it, and maybe not even then.

If anyone is in a better position, to do the same thing, I will help in anyway I can. Ideally if it works well, epub software might easily be adapted to display it as well and thus solve the problem in one blow.

Greg Schofield
Perth Australia
(An English High School teacher)

GregS
11-09-2007, 07:45 PM
bob_ninja I found reference to at least two when I first looked at this thread (somewhere). I looked at one http://www.sandroid.org/GutenMark/ which I have mistakenly referred to as Gutenberg Mark.

Having been dealing with Gutenberg texts for a good many years now, I wish I had found it earlier. For many purposes it looks really useful. I will certainly be using it.

I unfortunately did not look at the other software that was mentioned, and now cannot remember them. I will look again at the thread and if I can read the whole thing through (time is scarce) I will try and write a small summary.

But have a look at GutenMark, I have not downloaded it yet, but it seems to do all the most boring bits really well. I cannot say much more until I use it.

bowerbird
11-09-2007, 07:54 PM
ninja bob said:
> I started writing my own tool
> and would like to avoid reinventing the wheel.

bob, i highly recommend that you proceed...

i wrote such a tool, so i can tell you that it was
one of the greatest programming experiences
that i've had in over 25 years of for/next loops.

it's not an easy task, but if you have persistence
-- ok, "tenacity" might be a more accurate word --
the resistance _will_ crumble from your onslaught.

the secret formula is to cumulate your successes,
not to seek one magic bullet. that's all i can say...

on the other hand, if you just want the p.g. library
to be _consistent_, so you can then add value to it,
and you have no particular appetite for the challenge
of coding the application to produce that consistency,
just wait until i release my mirror of the p.g. library...

it will be a consistent, structured version of the library,
in a format that makes it easy to write the routines to
recognize the structural elements within each e-text...

as just one example, if a line is preceded by more than
three blank lines, that line is a header for a new section.
the header is terminated by two successive blank lines.
if there is one blank line within the header, you have a
two-part header (e.g., chapter x / the lobster quadrille).
so there's the pseudo-code for finding headers in z.m.l.

-bowerbird

kovidgoyal
11-09-2007, 08:23 PM
ninja bob said:
just wait until i release my mirror of the p.g. library...


You wouldn't happen to have a timeline on this would you.

bowerbird
11-09-2007, 09:04 PM
> a timeline

i sure do. when the mirror is available, i'll let people know.

-bowerbird

bowerbird
11-09-2007, 09:16 PM
greg said:
> I do not understand this at all.

i'm sorry. but that's ok. :+)


> I am only suggesting a lightweight standard
> no more difficult than Xhtml or epub

we differ on what constitutes "lightweight".


> no more difficult than Xhtml or epub

let me tell you the way i am framing this matter...

actually, let me just send you to the place directly:
> http://pgdp.net

that's the website for _distributed_proofreaders_.

they are the _volunteers_ who actually _digitize_
most of the project gutenberg e-texts these days.

they scan books, or find scan-sets from elsewhere,
do o.c.r. (or get that from elsewhere too), and then
subject the o.c.r. results to proofing and formatting
rounds administed right there in a web-based system.
after which results are assembled by a "postprocessor"
into the files that are submitted to and posted by p.g.

unless you've got an idea about somebody else doing it,
_these_ are the _volunteers_ who would do your markup.

i emphasized that these people are _volunteers_ because
they aren't getting paid to do this. they walk in the door
and are put to work. they don't necessarily have training.
they're motivated to work, but you can't order 'em around.
you can't force 'em to do something they don't want to do.

although a technoid faction has tried to entice them into
doing markup in tei-light, it's been some very slow going.
they spent about 6 years in various stages of "planning".
(yes, you did indeed read that right, i said _six_years_...)
over the last year, a few people did some 200 .tei e-texts.

the thought of treating the backlog of 15,000+ e-texts
hasn't even considered moving off of the back-burner...
(p.g.'s count is much higher, but they have _duplicates_
and an ever-increasing percentage of non-book items,
most recently .mp3 audio-books done by librevox.com.)

the person who developed the brand of .tei they're using
-- pgtei -- isn't interested in building any tools for them,
so there hasn't been much interest from the volunteers...

that's the situation as it exists today. that's how i frame it.
if you frame it differently, we won't understand each other...
perhaps you still consider this to be merely "hypothetical"?
that's not a bad thing. but there's a reality that's here now.


> As for future proofing, you seem to miss the point altogether.

i'm sorry. but that's ok... :+)


> I have taken at least seven or eight Shakespeare plays from Gutenberg,
> turned them into word processing documents, cleaned them and
> then through stylesheets reedited them and finally after a lot of effort
> (plays are really hard to do compared to novels) produce a pdf
> to print out copies for my students.

ok, now _that_ i understand. perfectly well.

my intention is to take those same plays -- and everything else --
from project gutenberg, run them through my app to convert them
into z.m.l., after which they can be printed out to .pdf immediately.

in other words, my converter-program does all the clean-up and
formatting work that took "a lot of effort" for you to do manually.

you saw -- from gutenmark -- how such a program can save time.
i intend for my program to be even better than gutenmark at that.

(and i _sincerely_ wish ron burkey was still developing gutenmark,
because i believe a healthy competition between us would be fun,
and push the state-of-the-art to a high level good for all of us...)


> In short, though the tools are wrong for the purpose I have
> been doing just sort of thing because I had no choice -
> but the end result did not give anyone else useful.

so you're aware of the same thing confronting the d.p. people,
that there are no tools which can help you do what you want...

i'm sorry about that.

and i don't have any suggestions for you, either.

except maybe to consider _why_ you don't have any tools, and
how that ramifies on the wisdom of the path you're choosing...

i _do_ think you are too hard on yourself when you say
"the end result did not give anyone else useful", because
your students _did_ get those plays in .pdf format, right?
(but you might want to consider rewriting that sentence.)


> I could just as well have been properly marking up the play
> (using TEI derived tags) and placed back in Gutenberg
> something useful to others.

i'm not altogether sure how "useful" a .tei file is these days.
but i'm sure you'll inform me that it _will_be_ useful later on.


> I have dealt with HTML and text versions of a variety of literature.
> For some purposes just being in HTML makes things very easy,
> but it also can make things a lot harder as well.

if you care to expand on that, i'm sure i would find it interesting. :+)


> Hence I may in the future for whatever reason,
> desire to quickly retrieve Chapter Seven of "Pride and Prejudice"
> how can this be done unless the computer has
> a means of finding exactly what I asked it to find?

well, in z.m.l. it would be simple for you to specify such a request...

i just made a post telling ninja bob how to find headers in a z.m.l. file.
so he'd point his program at pride and prejudice, and fish out chapter 7,
i.e., the chapter heading that said "chapter 7", or -- failing that -- just "7".


> These people have not been wasting their time,
> their precision is not useless but vital,
> and their knowledge (a part of which resides
> in the very code of TEI) cannot be ignored.

i don't ignore their knowledge.
i don't say they're "wasting their time".
and i certainly don't say their precision is "useless".
i think it would be _very_ useful. if only we could afford it...


> And I repeat an ultr-light version of TEI
> need be no more difficult than what we are already using

well, "we" -- as in you and i -- aren't "already using" the same thing.

.tei-light (or your "ultra-light") might not be more "difficult" than what
_you_ are already using, but it's _way_ more difficult than what i'm using.

but i'm not your target-audience anyway...

and neither are the people here at mobileread.

your target-audience is the volunteers over at distributed proofreaders.

and they don't even need to be "convinced" -- they already _agree_,
or at least they haven't mounted an outright revolt against pgtei --
but in order for them to actually start doing .tei, they need _tools_...

so if you want them to act, find some tools, and they'll be very happy...

if you don't _have_ any tools, you're just crying out in the wilderness...


> For me this just places it amongst display technologies,
> which is no bad place to be. The problems of text repositories
> is a different thing altogether.

cost-benefit ratio. that's all i can say: cost-benefit ratio.


> A huge labour but one that can accumulate
> over time in a systematic and reliable way.

go build such a library, and prove it has a superior cost-benefit ratio.

i predict you'll go broke before you ever start returning any benefits...
prove me wrong.


> I have serious reservations that it answers the right question.

i say that you're answering the wrong question, and
you're saying that i'm answering the wrong question.

that's what makes a horse-race.


> However, as a display technology it may well have a place.

thanks and all, but this horse-race is not for second-place.
it's to see who can _win_. i tell you my horse is gonna win.
your horse? certainly won't finish, and might not even get
out of the gate in the first place. so, does my saying that
give you motivation to prove me wrong? fine. then do it...


> I would ad the proviso that if it can easily translate
> from epub and to epub then this would be a vital attribute
> in its acceptance, especially if it is as easy as you say to code with it.

i'm not even gonna turn on that capacity. i don't have to.
plus doing so would only help to help out _your_horse_...
do your own markup. it's important that you feel its pain.
that cost is why your cost-benefit ratio will never be worth it.


> For my part, when time permits, I intend to look carefully
> at epub and try and make a version of TEI to fit it, and then
> write a small program to translate it one way and the other.

there are people doing that, if you don't want to duplicate effort.
they might love to have your help. do the research to find them.
of course, if you want to do it yourself, go ahead. i always do...
(however, i still do the research, because that's the smart thing.)

-bowerbird

p.s. if your target is project gutenberg, you should
_not_ develop your own "ultra-light" .tei, because
the guy who built their pgtei (who is their webmaster)
is _extremely_ protective of it, and will _attack_ you
if you try to suggest any alternative. (because, after all,
he's spent all these years developing it, so of course
he'll always believe he knows more about .tei than you.
he's a nasty character. don't cross him unless you dare.)

kovidgoyal
11-09-2007, 10:02 PM
> a timeline

i sure do. when the mirror is available, i'll let people know.

-bowerbird

I meant when do you expect to be able to make the p.g. mirror available to the public, as in a date, or a length of time. If you dont have an estimate, that's fine too :-)

GregS
11-09-2007, 10:59 PM
bowerbird many thanks for the reference PGTEI I have missed somehow (not surprising I miss a lot of things).

It looks like much of the most important stuff has been done this has been a great reference for me and although you disagree with the direction it is just what I was hoping for.

Given what seems to be there, a subset definition (which requires no change to PGTEI) especially for epub, or whatever, is not hard to do, or a script to produce it not fundementally difficult.

I have a little long term project of my own all about marking up text graphically, and have done just a little preliminary codeing for it. I am waiting on the release of REBOL 3 to move that particular project forward I cannot make even vague promises about it though.

Suffice it to say, the program tool/application problem is the biggy. However, it does not necessarily mean big applications to solve it, but a slightly different approach to how tags are in fact applied.

In short you have made my day.

In terms of your own project, I am still very hazy, I will latter today or tomorrow go through the thread carefully because you have given examples and see what I can make of them.

bob_ninja
11-09-2007, 11:55 PM
Greg,
Thanks for the link. I examined its description. I actually have much more modest goals to simply adjust text format content and not upgrade it to HTML or any other richer format. I just want to get a better screen use for smaller reader screens.

For instance, I want to remove the annoying end-of-line markers that breakup paragraph into many segments and cause my reader to waste a lot of space:

http://img.villagephotos.com/p/2007-11/1286065/100_9839.jpg

After the line markers are removed:

http://img.villagephotos.com/p/2007-11/1286065/SCAN0031.JPG

Now the screen is filled and used very well, hence less scrolling.
I'll check the requirements list from the initial post and try to add more options. I'll post it as a freeware in a new post.

bowerbird,
I use Java which has regular expressions capability. I plan to simply build some search/replace regexp patterns and allow a user to enter his/her own to customize according to individual preferences. Shouldn't be too bad. Actually most of the work is for interfaces, GUI and CLI.

GregS
11-10-2007, 12:41 AM
bob_ninja no worries and thanks.

I habitually use OpenOffice to clean up text, more for general convenience because its layout tools are reasonable for my purposes and its pdf creation reliable.

I always forget the correct grep code for end of line markers, and have to experiment a little each time in find/replace to get it right (I swear I always seem to forget the most used things - it is an annoying habit).

Hopefully in the not to distant future we may see a plethora of light weight gui tools that really do make lots of little jobs much much easier to perform.

bowerbird
11-10-2007, 12:41 AM
ninja bob, when all you want to do is unwrap the hard line-breaks, try this:
> http://z-m-l.com/unwrap.pl

it works, for the most part, except it _also_ unwraps tables, poetry, and
other things which should _not_ be unwrapped. this is because of one of
the _biggest_ problems with the project gutenberg e-texts, namely that
these lines which should not be unwrapped are not unequivocally marked.

so one of the changes that i make when i convert a p.g. e-text to z.m.l. is to
_detect_ these lines, and then _mark_ them by giving them a leading space.
later, my unwrapping routines for z.m.l. _respect_ a leading space in a line
as a signal that that line should not be unwrapped. mission accomplished...

this is just one example of one change that needs to be done to a p.g. e-text
in order to make it more functional. z.m.l. as a whole is a _collection_ of
_all_ of those changes my focused research has deemed to be necessary.

some people will tell you a _human_ has to go through the e-text to "decide"
which lines should be marked as immune from rewrapping, that the decision
takes human intelligence, and cannot be programmed into a computer. well,
i won't tell you that my routines never make any mistakes, because they do.
but i _can_ inform you that they make _most_ of the decisions correctly, and
that's because i worked, and worked some more, then worked even _more_,
so that they _would_ make most of the decisions correctly.

so one part of the better functionality my z.m.l. mirror will give developers
will be the ability to unwrap the text at will, without introducing problems...

-bowerbird

bowerbird
11-10-2007, 12:58 AM
> I meant when do you expect to be able to make the p.g. mirror available to the public,
> as in a date, or a length of time. If you dont have an estimate, that's fine too :-)

yeah, i know that's what you meant. and that's why i answered like i did...

i learned a long time not to make estimates. don't expect it until it actually arrives...

having said that, i'll also say "it depends". (which, yeah, isn't any more informative.)

the body of each e-text is pretty much already in z.m.l. format.
to the extent that it's not, the changes are pretty much automatic.
if that was all i was concerned about, i could do it in a week or two.

the problem area for each e-text is the front-matter: the title-page,
table of contents, dedication, list of illustrations, all that type of stuff.
what i _want_ to do is edit all of that to an extremely high standard...

but it's pretty slow going. even at 5 minutes per e-text, that's 12/hour,
or 100 for an 8-hour day. and slackers like me don't work 8-hour days.
so when you've got 15,000 of the suckers, even 5 minutes per adds up.

eventually, after enough time goes on and i keep avoiding this task,
i'll undoubtedly drop my desire to hit that high standard, and go with
something more quick-and-dirty. i've noticed that hadrien settles for
the title and the author on the title-page and then jumps into the book.
if i did that, i could pull the info from the catalog, and it'd be very quick.
if i decided to try and write some code to rework the front-matter that
is actually present in each e-text, then that might or might not be quick,
depending on how well the programming went. could even be very slow.
i can't even do an estimate on that until i've hand-edited enough e-texts
to get a handle on what the typical edits are, and how to automate 'em...

i've also considered building a wiki and asking the public to go at it...

so, depending on how all of this shakes out, it could be relatively soon,
or it could drag on for a little while, or it could drag on for a long while.

but i certainly don't advise that anyone hold their breath waiting for it...

indeed, don't even _expect_ it until it has actually arrived...

-bowerbird

bowerbird
11-10-2007, 01:07 AM
greg said:
> However, it does not necessarily mean big applications to solve it,
> but a slightly different approach to how tags are in fact applied.

ok.

i'm not sure _exactly_ what that means, but ok... :+)

do keep in mind that, in the real-world of project gutenberg today,
the tags are being applied by distributed proofreader volunteers...

now maybe you have something completely different in mind...

but if in your mind those volunteers would be applying .tei tags,
then you really need to go over and introduce yourself to them.
they pride themselves on being very friendly -- they'll tell you
that over and over and over -- but another truth is that they
don't take kindly to strangers telling them how to do their job.
so you will need to bow down and ingratiate yourself to them
before you should even whisper a suggestion about what to do.
_especially_ about .tei, because it's been a "plan" for so long...


> In short you have made my day.

great.

not everyone else feels the same, but
that's the way life is in the honest lane.

glad i could be of some help...

-bowerbird

kovidgoyal
11-10-2007, 01:08 AM
Doesn't gutenmark already handle front matter? You could just lift the routines from there.

bowerbird
11-10-2007, 01:23 AM
> Doesn't gutenmark already handle front matter?

we have different definitions of "handle".

the title-page in a z.m.l. file is highly structured,
because its info is collected into a library catalog.

many of the other parts of z.m.l. front-matter are
expected to conform to a certain framework too...

on the other hand, front-matter in p.g. e-texts is
probably _the_ most wildly inconsistent element
in the entire catalog, which is not surprising when
you consider that it is coming from a wide range
of different publishers, so taming it is difficult...

-bowerbird

GregS
11-10-2007, 02:13 AM
bowerbird No worries, I am not about to burst in and tell people what to do. I will watch for a while, make a few small suggestions, and if other things work out, maybe present them with some tools to make life easier.

The thing is what they are doing is almost exactly what I had in mind. There is no telling them what they should be doing, though I might have something to say on IDs and xml-namespace use.

I use Gutenberg a lot, but I usually go there to find something specific, I totally missed the tei side of things. Ie I usually get to the text without dallying elsewhere on the site (I have been using it for well over four years).

You have been of genuine help, and I thank you. I am still confused over Z.M.L if you like under a promise of strict confidentiality maybe you could send something to my email address and I could discuss it privately with you. I am intrigued, I don't doubt the efficiencies you claim. The reference I previously followed was a error (server problems probably). I am however no fan of python, I don't see myself ever using it again (too many special syntaxes).

bowerbird
11-10-2007, 02:27 AM
greg said:
> You have been of genuine help, and I thank you.

that's what dialog is all about.

it can only happen if both sides hold up their half,
so you deserve as much credit as you're giving me.

another note, if you wanna mush pgtei and epub together, is
that some of the personalities in those various camps do not
get along particularly well with each other, just so you know.
the only thing they seem to agree on is they both hate me... :+)

there are other people bridging the gap, so it's not a gulf
which is impassible, but feel your way carefully at first...

-bowerbird

p.s. i'm using perl (*.pl). that's just for web stuff though.
my fav language is basic, specifically the realbasic compiler.

GregS
11-10-2007, 04:32 AM
Whoops I misread .pl as .py, sorry for the confusion (ps. if I find Python horrible, you can imagine how I feel about Pearl).

I very much see epub as a handy standard for eink displays etc.,. much better than the pig's breakfast of different formats, and at least predictable enough for translation to other things.

Ebub can't possibly do what TEI can, but it also does not have the overheads, everything is a trade-off.

We do need a standard, it does not have to be incredibly good, but it will do for ebook reading - epub fits the bill, it is more important at this time to get the HW vendors all on the same track in order to pull publishers and public together.

Anyhow, it leaves plenty of room for other solutions. At the moment I see epub as an easy root for traditional publishers to move into the digital market, without s steep learning curve and able to consolidate sales in a single format. I see TEI as having a key role in text repositories of all kinds (text in the sense of literature), however this does not stop other forms being adapted there are things badly displayed by HTML, that may well benefit from Z.M.L. if it is easy to code in, compact and efficient that could really be a killer in many areas being badly served by HTML. Just a suggestion.

HarryT
11-10-2007, 09:23 AM
kovidgoyal, i have substantial replies to your previous posts,
which i would like to post, but i don't want to _monopolize_
the conversation here. i'd like to give other people a chance.
when two people overtake a thread, it can get boring fast...


Go ahead and post; even though it's Kovid who asked the questions, I'm sure that the answers will be of interest to others. Anyone who doesn't wish to read this thread can simply ignore it.

If your response is a personal reply to Kovid only, and of no relevance to anyone else, using PMs is probably best.

[Moderator]

bowerbird
11-10-2007, 03:24 PM
greg said:
> ps. if I find Python horrible, you can imagine how I feel about Pearl).

to me, they're all just tools. i had to go to a scripting language because
realbasic only creates offline apps, and i wanted to do web equivalents...

i use only the most basic of programming concepts -- mostly for/next
and split routines -- so any language is capable of doing what i want...


> We do need a standard, it does not have to be incredibly good,
> but it will do for ebook reading - epub fits the bill, it is
> more important at this time to get the HW vendors all on
> the same track in order to pull publishers and public together.

i think those are all shibboleths. but there's no need to discuss that.

***

harry said:
> Go ahead and post

i did, eventually, after giving other people some breathing room... :+)

-bowerbird

GregS
11-10-2007, 09:29 PM
bowerbird "to me, they're all just tools. i had to go to a scripting language because
realbasic only creates offline apps, and i wanted to do web equivalents..."

I understand this position. However, the potential for interpretive scripting to give users control over their digital lives is something that is an off-topic in this thread, but from other points of view is worth considering. The particular nature of each language in that context at least, is important. The fact is if you put the effort in all things appear equal, but elegance, consistency, ease of learning and editing transparency become critical if that potential is ever to be realised. Have a look at LUA and REBOL, what I consider new generation languages (Python is distinctly old generation, and as for PEARL - never mind).

I suggest this as another possiblity for your own project.

bowerbird
11-11-2007, 02:52 AM
greg-

thanks for the suggestion. :+)

but i'm too old a dog to be learning new languages.

youngsters familiar with a language can take my pseudocode
and churn out versions of my programs in a matter of weeks,
if not less. so i have no need to take on that burden...

-bowerbird

HarryT
11-11-2007, 06:53 AM
Bowerbird,

I think that many of us would like to see a practical example of what your tools can do to a PG book.

Could you please, using your tools, produce an HTML version of:

http://www.gutenberg.org/dirs/etext01/wrnpc12.txt

That's the plain text PG version of "War and Peace", and contains lots of interesting "structural elements" such as footnotes. I'd like to what your tools make of it, if you'd be so kind.

Thank you,

kacir
11-11-2007, 08:53 AM
bowerbird,
I use Java which has regular expressions capability. I plan to simply build some search/replace regexp patterns and allow a user to enter his/her own to customize according to individual preferences. Shouldn't be too bad. Actually most of the work is for interfaces, GUI and CLI.

I personally use Vim text editor for performing power use of Regular Expressions.
Vim - www.Vim.org - has Regular Expressions that are more powerful and better documented that Regular Expressions in Perl.

Vim has many advantages over the use of a programing language with RE support. The most notable advantage is that playing with REs in Vim (or its graphical version Gvim) is interactive - you can try replacement like this:
<esc>:%substitute/string_to_replace/replacement_string/g<enter>
Then you inspect the results, use Undo if needed and try again.
If it works like intended you simply create a vim macro with that substitute command.

Please send a personal message to me if you are interested in using Vim.
It does not matter what operating system you use, there is version of Vim for it. Even if you use such obscure and/or specialized OS as Minix, QNX, NetBSD, Amiga OS (yes, the operating system for Commodore Amiga), or even Vista :D


There is also a very nice formatter for text called par.
see following link http://vim.wikia.com/wiki/Awesome_text_formatter for description.

bowerbird
11-11-2007, 12:55 PM
harry, my voice is filtered here, so i've answered you here: http://z-m-l.com/mr/harry_wants_an_example.zml

HarryT
11-11-2007, 01:24 PM
harry, my voice is filtered here, so i've answered you here: http://z-m-l.com/mr/harry_wants_an_example.zml

Thank you for that reply. You ask "why War and Peace"? It's not because of its length, but because it makes rather complex use of footnotes; it's by far the most technically complex book that I've done a "hand" conversion of. It's because I've processed it "by hand", and am very familiar with the difficulties it poses that I'd be extremely interested to see what can be made of it by automated processing tools.

bowerbird
11-11-2007, 01:54 PM
i can handle the footnotes automatically. i'll convert it if you'll do quality-control on it.

HarryT
11-12-2007, 03:11 AM
I'd be happy to offer comments on it.

bowerbird
11-12-2007, 04:22 AM
"offering comments" isn't quite the same as doing a quality-control pass. which will you do?

HarryT
11-12-2007, 04:25 AM
I'll do whatever you'd like me to do, within reason. Tell me what you'd like! As I say, I'm very interested to see how far automated processing can go with a "real world" example such as this.

HarryT
11-12-2007, 05:50 AM
I must congratulate you - that's pretty impressive! Not quite perfect, but that just underlies the difficulty of machine recognition of the "meaning" of text.

The errors I've found are:

Footnotes 70, 71: The footnote just translates the first line of the verse, with the remaining lines being left in the text. The footnote should be the entire verse.

Ch 9-XIX: There's a "grid" of letters of the alphabet and equivalent numbers. All the lines should be left aligned; the bottom lines have been indented.

Footnotes 133: Same as 70, 71 - the footnote is a translation of a verse, and only the first line of it has ended up as a footnote, with the rest left in the main text.

Also - and I don't know if this is deliberate or not - but the table of contents appears to be duplicated at the start. Perhaps this is due to the division into "Books"?

These would be very minor issues to correct by hand. Excellent job!

Alexander Turcic
11-12-2007, 06:01 AM
bowerbird decided to leave. If you like to talk further about formatting PG texts, you're welcomed to start a new thread.