View Full Version : Page numbers in ebooks for scholarly research?


jbenny
11-06-2007, 02:32 PM
Something that Panurge brought up in another thread needs further discussion. It was getting lost in the other thread, so i am copying the relevent parts here, in a new thread (see below).

Although the average ebook reader may not care about Panurge's question, remember that ebooks aren't just for entertainment. As more of the world's libraries become digitized, ebooks will be used by professionals, as well as casual readers. In fact, the digitized version will make it easier for everyone to access books that would otherwise be difficult or impossible to obtain.

If you have a suggestion on how to handle this issue in an ebook, without also intruding too much on the casual reader, please speak up. Also, please let's not wander off into left field about non-relevent topics (yes bowerbird, I mean you). This discussion is about using currently popular ebook formats, not your zml. I would appreciate it if you would limit your suggestions appropriately.

My responses to Panurge were focused on XHTML as it is used in epub, but if you have a suggestion on how to do things in Mobipocket, LIT, PDF, LRF, or other popular ebook formats, suggest away.

-------------------

Panurge:

> 14. don't put pagenumbers inside the text/paragraphs.

For the casual reader, this may not be an important point, but for someone who publishes scholarly texts, which require documentation, it is. The page numbers of the original text do matter, as does the exact text that lies between them.

[snip]

But what really matters for scholars who have to show in their footnotes where to locate the authority for the text they cite, a lack of representation of the pagination of the original renders the e-text useless.

[snip]

At the same time, we who are scholars have to decide whether or not the original print text-source is what we're going to refer to or the e-text facsimile. If the latter, do we regard it as a new edition or as a faithful representation of the print copy? If we don't account for these needs in our re-encoding now, we'll simply have to redo the e-texts in the future if we expect electronic texts to gain much of a oothold in the world of scholarship and education.
-------------------

jbenny:

You bring up a very valid point that most of us don't think of (me included). Can you suggest a way to handle this without having the page numbers in-line with the text? Most of us would find the visible page numbers too obnoxious.

[snip]

For XHTML markup, one thing that comes to mind (just off the top of my head) would be to enclose all the text that makes up an original page with a surrounding tag that uses the "id" attribute to hold the page number.
-------------------

jbenny:

http://www.mobileread.com/forums/attachment.php?attachmentid=7041&d=1194151963

The content is totally bogus. I just made it up for this test. I used a <span> tag to mark the beginning few words of each page. Since a physical page is likely to fall mid-sentence, you can't use a block-level tag like <div>. Well, you could, but that would also break a sentence in the ebook, which is not what you want.

[snip]

This is far from an ideal method, but it was the first thing that I tried. Perhaps someone has a better suggestion? How to delimit the page breaks for those who need them, while not being in-your-face for the average ebook reader? In a web browser, some javascript could make this a lot easier. However, I don't know of any ebook readers that do javascript (not counting PDAs).
-------------------

Panurge:

Some such solution might satisfy everyone. Current scholarly journal databases such as Project Muse give the page numbers in square brackets within the text--an "ugly" solution, I suppose, but a simple one. JSTOR, the dominant archive of scholarly journals takes a different tack. It uses searchable PDF files and presents a scanned graphic representation of the original journal page, so the pagination problem is not an issue. However, the downloaded PDFs don't look all that great on the Sony Reader, though they are usable.
-------------------

jbenny:

Although neither is ideal, both methods could easily be done in an epub ebook. The first would be very simple, but "ugly" as you say. Including a scanned image of each page (PDF, PNG, JPG, etc.) that is linked from the XHTML text is also possible. This would of course make the epub much larger and more work to construct.

I haven't had the time to think about other ways to do this, but there is probably a good way to do this strictly in XHTML, without having to include scans or put visible page numbers in the text. Perhaps someone else can suggest something?

kovidgoyal
11-06-2007, 02:36 PM
Just to make sure I understand the isuue. You want a way of locating an arbitrary object in a ebook file (it could be a sentence, a table, a figure etc) unambiguously.

Now we come to the question of resolution. What is the smallest object you are satisfied with being able to reference? A paragraph, (a page at some rendering resolution?). Note that using pagenumbers from printed versions is not good enough as in the future there may not be printed versions.

EDIT: An example from physics research articles. A resolution of sections is usually sufficient. i.e. people refer to section so-and-so of paper so and so.
I don't know if that is sufficient resolution in general though.

jbenny
11-06-2007, 02:43 PM
Actually, Panurge was asking for something a bit simpler. He just wanted a way to correlate the text of an ebook to what page it originally came from in the scanned book. You see the problem of citing a reference in the conventional manner, which uses page numbers.

Going further and being able to reference any text or object in an ebook in a standard manner might be even better for some purposes.

The real issue is how to do any of this without having extra text or markup visible to the casual reader? As I suggested in my previous comments, using javascript would make this a lot easier, but ebook readers don't usually have that capability.

jbenny
11-06-2007, 02:45 PM
Your point about page numbers perhaps disappering in the future is well taken and very likely. However, for quite a while, we would still have to deal with page numbers and references to them.

kovidgoyal
11-06-2007, 02:58 PM
Umm is that really important? Assuming the scans are available online and someone comes acroos a reference like "in some book on page 234" and they want to look up the reference can't they just access the scanned book from say google books and look it up?

My point is that this is a rare usage scenario so perhaps the better solution is no solution. Just making scanned copies available freely online should be sufficient.

Now addressing actual ways of doing this is you disagree with me. Perhaps this should be left up to the reader apps. i.e. use some semantic tagging of content to indicate which page it comes from, then if the user selects "reference mode" in the reader app, the reader should be responsible for displaying the reference information.

jbenny
11-06-2007, 03:25 PM
If indeed scans of the original are always available, then yes, this may not be an issue. However, that may be a big if. And what if the scans were available, but later were not?

And what of new content that is purely digital? How do you make reference to a particular passage in an ebook that is unambiguous? I guess you could say "chapter xx, paragraph yy, sentence zz", but that is cumbersome.

It probably would be better to let the reader software/hardware handle this. However, I don't know of any readers that do. That's why I was wondering about ways to do it within the limits of existing ebook formats.

As ebooks become more prevalent and possibly replace p-books, some standard method of dealing with references should be available. Perhaps this is something that should be addressed in a future standard (epub or otherwise). In fact, the current focus of ebooks seems to be exclusively casual reading. If ebooks are to be used to replace textbooks and other scholarly works, then the current standards need improving. I know that currently PDF is used to solve some of these problems, but we all know the problems of using PDF on other than full-screen devices.

bowerbird
11-06-2007, 03:36 PM
i've responded in the original thread.

first here:
> http://www.mobileread.com/forums/showthread.php?p=112217#post112217

then here:
> http://www.mobileread.com/forums/showthread.php?p=112451#post112451

and just recently, here:
> http://www.mobileread.com/forums/showthread.php?p=112962#post112962

i see no reason for a new thread, and won't repeat my posts here:

the links above are one example of solving the problem
in a digital environment, of course. but they only work
because the target-file had "i.d." references coded into it.
without such referents, it's difficult to attack this task...

however, there's no reason browsers can't be improved
with a simple mechanism that let you _link_ to a page,
adding a "search phrase" which the browser acted upon.

that is, you could link to this page:
> http://z-m-l.com/go/myant/myantp111.html
but also append the search-phrase after it:
> http://z-m-l.com/go/myant/myantp111.html?sp="don't see how"
and have the browser:
(1) load the page, and
(2) execute the search,
(3) locate you right around an intended spot.

indeed, it might even be child's play for a web-browser
plug-in programmer to code a plug-in to do this now...

the advantage -- that you could link to any phrase
on any page, even if the author of that page hadn't
coded any i.d. references -- is huge, seems to me...

otherwise, your system depends on other people
having done the work you wish them to have done,
and that's never gonna prove to be a tenable solution.

-bowerbird

NatCh
11-06-2007, 03:50 PM
I thought about this issue some years ago, because my wife is a literary scholar, so I tend to consider that side of things, even though it doesn't impact me directly. :shrug:

Going forward, if this e-book thing really does take off, some way of absolute reckoning within a text that isn't dependent on pages is going to have to emerge. For books where original or scanned files exist, page references will continue to work indefinately, but they may or may not, depending on the method, for things which are never published physically.

My thought is probably a paragraph or line numbering approach, either from the beginning or from chapters or whatever other type of sectioning makes sense, would work well enough, but it would need to be present and the same in all versions of the e-publication. Preferably, it could be toggled on and off in the reading software. :wink:

I think, however, that this is yet another example something that's dependent on a "standard" e-book format, and would need to be built into both files and viewing software in order to be at all viable.

Just my thoughts, salt to taste. :nice:

jbenny
11-06-2007, 03:59 PM
Natch, very good thoughts on this.

You are correct that this needs to be standardized in both the ebook format itself and the reading device or software. Yes, some way to toggle this information off and on would be very much desired, so as not to interfere with normal reading. I just don't see this happening without some additional functionallity built into the reader, however. This needs to be addressed by standards as well.

Edit: I hope that some of the folks at IDPF are listening in and taking notes.

jbenny
11-06-2007, 04:04 PM
Just to follow up the comments on standards - although things like this would be best addressed by some future standard, I am still curious as to ways that this might be dealt with today, within existing standards and using popular ebook formats.

Patricia
11-06-2007, 04:05 PM
My experience is that while scientists often write short but pithy papers where section references are sufficient, things are different in the Humanities. Literature specialists have to refer to author, title and page and often want to use a particular scholarly edition. Philosophers do the same when referring to the work of contemporaries.

I teach a fair amount of Plato and Aristotle and find online texts are a problem.When referring to Plato, it is essential to use Stephanus numbers, which will identify any sentence in his entire oeuvre. These appear as marginal numbers and letters in most print versions in both English and Greek. But the numbers simply don't appear in the online versions of Plato (except for the Perseus Project version). So I can't recommend them to students and don't use online versions myself.

(This is why I've never uploaded a Plato dialogue: without the Stephanus numbers it is useless to me. But with them it is irritating to general readers.)

kovidgoyal
11-06-2007, 04:09 PM
As a practical matter I often find that unambiguous identifiers in the source *text* itself work "well enough". For instance, "the paragraph after figure 3" or the "introductory text of section X".

The problem with line numbering is that the number of lines depends on screen size/font size with a reflowable format. Paragraph numbers might work though assuming the referenced document semantically identifies paragraphs.

jbenny
11-06-2007, 04:20 PM
My experience is that while scientists often write short but pithy papers where section references are sufficient, things are different in the Humanities. Literature specialists have to refer to author, title and page and often want to use a particular scholarly edition. Philosophers do the same when referring to the work of contemporaries.

I teach a fair amount of Plato and Aristotle and find online texts are a problem.When referring to Plato, it is essential to use Stephanus numbers, which will identify any sentence in his entire oeuvre. These appear as marginal numbers and letters in most print versions in both English and Greek. But the numbers simply don't appear in the online versions of Plato (except for the Perseus Project version). So I can't recommend them to students and don't use online versions myself.

(This is why I've never uploaded a Plato dialogue: without the Stephanus numbers it is useless to me. But with them it is irritating to general readers.)

In works like this, where the numbers are expected to be present, there is no reason that the online versions couldn't include them. An example epub that I posted in another thread showed one way to do paragraph numbers in a separate column. Something similar could probably be done with the Stephanus numbers (I'm sure there are other ways to do this as well).

As for the numbers being irritating to the general reader, they could easily be toggled with some javascript and an additional stylesheet in a web browser. You just couldn't toggle them in an ebook version, as far as I can see.

DaleDe
11-06-2007, 04:22 PM
Your point about page numbers perhaps disappering in the future is well taken and very likely. However, for quite a while, we would still have to deal with page numbers and references to them.

Page numbers are not enough. You also need more data to use a book in a bibliography or other scholarly reference. Different editions have different numbering and even a hardback vs. paperback have different numbering. Often the scanned book data is not detailed enough to define these differences. This is why, in business, where references are needed the sections and sometimes even the paragraphs are numbered. Page numbers are not really enough.

Dale

jbenny
11-06-2007, 04:30 PM
The problem with line numbering is that the number of lines depends on screen size/font size with a reflowable format. Paragraph numbers might work though assuming the referenced document semantically identifies paragraphs.

Yes, line numbering isn't going to work for ebooks, which are reflowable. Maybe paragraph numbers would be fine-grained enough?

NatCh
11-06-2007, 04:32 PM
As a practical matter I often find that unambiguous identifiers in the source *text* itself work "well enough". For instance, "the paragraph after figure 3" or the "introductory text of section X".True, but as Patricia alluded to, the Humanities usually require more precise references, and rarely have images. The method you're describing would be ... problematic when sourcing a quote from say, Great Expectations, for instance. :shrug:

The problem with line numbering is that the number of lines depends on screen size/font size with a reflowable format. Paragraph numbers might work though assuming the referenced document semantically identifies paragraphs.You make an excellent point. Line numbers may continue to be useful for things with short, fixed line lengths, such as poetry, but will probably be less useful for more flexible formats. They might could still work out if they were parsed from a fixed format, and applied where ever they showed up in the reflowed one. They usually only mark every fifth or tenth line, anyway, rather than putting an explicit number on each line. Still it's problematic.

Paragraph numbers would seem to be the better choice for text reference, as paragraph 33 will always be paragraph 33, no matter how many "screens" it takes to get to it or display it. :shrug: Most of the things that would need to be referenced this way would have identifiable paragraphs.

I suppose that it could be as simple as establishing an e-referencing standard such as formatting the e-text to an established page size, font and margin parameters, i.e. A4, 2.5 cm margins all around, in Arial 10 pt font. Then your references would be to the text in that "standard" formatting.

A cleverly programmed reading software could even internalize that parsing and report the "official" page number of the current page or a selected piece of text upon request. But we're talking not the currently designed hardware here, of course. Well, except the iLiad, it could likely be programmed to do this.

jbenny
11-06-2007, 04:34 PM
Page numbers are not enough. You also need more data to use a book in a bibliography or other scholarly reference. Different editions have different numbering and even a hardback vs. paperback have different numbering. Often the scanned book data is not detailed enough to define these differences. This is why, in business, where references are needed the sections and sometimes even the paragraphs are numbered. Page numbers are not really enough.

Dale

I was just responding to Panurge's question about this, which mentioned page numbers. Yes, for others, page numbers would not be fine-grained enough.

I think it is good to get input from people with different needs. This is valuable information for those who may want to try to provide an interim solution, as well as those who may set future standards.

DaleDe
11-06-2007, 04:35 PM
Yes, line numbering isn't going to work for ebooks, which are reflowable. Maybe paragraph numbers would be fine-grained enough?

Certainly much better than page numbers. Likely better to include a chapter number and then paragraph number to keep the numbers a bit smaller.

Dale

jbenny
11-06-2007, 04:40 PM
I suppose that it could be as simple as establishing an e-referencing standard such as formatting the e-text to an established page size, font and margin parameters, i.e. A4, 2.5 cm margins all around, in Arial 10 pt font. Then your references would be to the text in that "standard" formatting.


But, do we want to be tied to some paper-based standard of reference for ebooks? There should be some standard way of doing this without having to send the human reader off elsewhere to find another copy of the same work. Also, this would entail the creation and maintenance of two different copies, just for one ebook.

jbenny
11-06-2007, 04:41 PM
Certainly much better than page numbers. Likely better to include a chapter number and then paragraph number to keep the numbers a bit smaller.

Dale

Yes, I was thinking of paragraph numbers within chapters or sections. Anything else would be too cumbersome.

NatCh
11-06-2007, 04:56 PM
But, do we want to be tied to some paper-based standard of reference for ebooks? There should be some standard way of doing this without having to send the human reader off elsewhere to find another copy of the same work. Also, this would entail the creation and maintenance of two different copies, just for one ebook.I was thinking of something more along the lines of:

copy text into Word (or whatever)
set page size as A4
set margins as 2.5 cm all around
set font as Arial 10 pt
find your passage (text search or whatever)
take your reference from here


Or even better:


select passage with stylus
click the "standard reference" button/menu item
use the reference the cleverly programmed reading software calculates for you based on the "standard" page size and format

bowerbird
11-06-2007, 04:57 PM
natch said:
> if this e-book thing really does take off,
> some way of absolute reckoning within a text that
> isn't dependent on pages is going to have to emerge
> For books where original or scanned files exist,
> page references will continue to work indefinately,
> but they may or may not, depending on the method,
> for things which are never published physically.

we don't need to have a document "published physically"
to paginate it, so you can use pagenumber references.

a digital document _can_ be paginated (a la .pdf)...

indeed, i believe that it's incumbent on a text's author to
create a "canonical paginated version" for referencing...

but before all that, we have to recognize that it is vital
-- actually, _imperative_ -- that we have an "official"
version of every document, one located in cyberspace
from now until eternity (yeah, i know it's a long time),
in a form that's _never_ changed. (if you wanna edit it,
then the edited version becomes a _new_ document,
located at its own never-changing place in cyberspace,
and _that_version_ can never ever be edited either...)

if you don't insist on this, there's no way you can build
a system that will never break. it's simply _impossible_.
you can build ones that are robust, to varying degrees,
but you can't know _how_ robust, and some problems
will -- to large and also unknown degrees -- be invisible.
and that's unacceptable to everyone, except big brother.

there are lots of people who'll try to sell you snake-oil
which they purport will "solve" the problem. it won't...
and you would be a fool if you were to believe them...

you absolutely need to build the system on concrete...
with a "canonical version" of each and every document,
which is easily referenced by every person at any time.

also, for "scholarly" stuff, these texts will be embedded
in their own separate infrastructure. thus, for instance,
_every_single_article_ from jama -- the journal of the
american medical association -- will be "put together",
in the _exact_ same way that their bound volumes are
sitting right next to each other in your library stacks...
you will be able to click from the last february article to
the first march article, as if they were a seamless whole.

anyway...

the linebreaks and pagebreaks in each canonical version
will be the "official" ones... that will not mean that you
have to live with them; once you've got the digital text,
you can remix it to your delight. but your remix is not
"official", _only_ the "canonical version" is, so any links
-- obviously -- will be targeted at the canonical version.
(because, really, why would they point anywhere else?)

-bowerbird

p.s. it's very good to re-read ted nelson every so often.

jbenny
11-06-2007, 05:01 PM
I was thinking of something more along the lines of:

copy text into Word (or whatever)
set page size as A4
set margins as 2.5 cm all around
set font as Arial 10 pt
find your passage (text search or whatever)
take your reference from here


Or even better:


select passage with stylus
click the "standard reference" button/menu item
use the reference the cleverly programmed reading software calculates for you based on the "standard" page size and format


Given a choice, I'd much rather use the second method :)

NatCh
11-06-2007, 05:05 PM
@bowerbird: Excellent points, probably some sort of e-Library of Congress is the most likely evolution, I guess.

Given a choice, I'd much rather use the second method :)That's most likely because you have something between your ears. :grin:

DaleDe
11-06-2007, 05:09 PM
Given a choice, I'd much rather use the second method :)

Actually calculating based on the hardbook is pretty simple to do. For a while now I have been discussing, in other threads, the need to pre-paginate a document even for eBook reading. (Not all readers do this.) If you are pre-paginating anyway you can keep the hardbook pagination is the same auxiliary file that is used to keep the pagination of the document. Once this information is available it can be used to report the page number.

Dale

NatCh
11-06-2007, 05:12 PM
True, but if you established a set format as a standard, then you could just calculate it on the fly for any book against the standard, you wouldn't have to store it. There's lots of ways to do this, surely one of them will suffice. :grin:

jbenny
11-06-2007, 05:16 PM
DaleDe and NatCh - but how would you accomplish either of these in existing ebook formats? As I said, they are more limited in capability than a modern web browser.

kovidgoyal
11-06-2007, 05:24 PM
I vote for paragraph numbering. Its the easiest to implement, the most robust across different renderer's and should have enough resolution to satisfy almost anybody. For applications that require individual line based addressing, a reflowable format is simply not the way to go.

For referencing individual lines, it would probably be better to reference the paragraph the line comes from and quote the line.

DaleDe
11-06-2007, 05:31 PM
DaleDe and NatCh - but how would you accomplish either of these in existing ebook formats? As I said, they are more limited in capability than a modern web browser.

Actually as I noted most eBook readers already prepaginate the documents. My old eb1150 does this at the time the file is built for the two font sizes that it was built for. Sony does this as part of the connect download for the 3 fonts sizes and on the fly when a new font is selected from a file that has not been prepaginated. The pagination stuff is already in the code. It just needs to know the page boundaries and font sizes. Some of the other formats like MobiPocket do not do this well if at all. PDF, of course, always does since it is a print format.

If we can get rid of the requirement to match existing hard books then paragraph number is the way to go.

Dale

NatCh
11-06-2007, 05:45 PM
DaleDe and NatCh - but how would you accomplish either of these in existing ebook formats? As I said, they are more limited in capability than a modern web browser.You're right, we've kinda evolved two parallel discussions, how to do it presently as an add on, and how it might best be done as a considered system. :shrug:

DaleDe
11-06-2007, 05:51 PM
You're right, we've kinda evolved two parallel discussions, how to do it presently as an add on, and how it might best be done as a considered system. :shrug:

The considered system and the present is colliding fast. We do need to fix this in epub I believe. I think the epub standard could have a meta file with the data in it to match page numbers for the real book and some standard proposed eBook sizes. This would allow scholarly research and improve the reading experience for eBook reader who want to know how many pages are in the book, what page they are on and when the next chapter starts. All can be accomplished with a pre-pagination file that is in the mix of files in the standard already. This would be an index like file that cross referenced page boundaries to xhtml files and locations in the files.

Dale

jbenny
11-06-2007, 05:59 PM
You're right, we've kinda evolved two parallel discussions, how to do it presently as an add on, and how it might best be done as a considered system. :shrug:

That was my intent. How can we deal with this now and how could it be done better in the future? If we can get the standards setting folks (IDPF and others) thinking about these issues now, perhaps it won't be too terribly long before we have a better, standard way to do these things.

Ebooks need to grow in incapability beyond just recreational reading. PDF is being used now, because it is essentially a paper representation of what we already have. We need other solutions for use with reflowable formats.

Since it seems we all agree that paragraph numbers are probably the best we can do with today's reflowable ebook formats, we still have the question of how to use them without making them too obnoxious and in-your-face to the average reader.

jbenny
11-06-2007, 06:13 PM
The considered system and the present is colliding fast. We do need to fix this in epub I believe. I think the epub standard could have a meta file with the data in it to match page numbers for the real book and some standard proposed eBook sizes. This would allow scholarly research and improve the reading experience for eBook reader who want to know how many pages are in the book, what page they are on and when the next chapter starts. All can be accomplished with a pre-pagination file that is in the mix of files in the standard already. This would be an index like file that cross referenced page boundaries to xhtml files and locations in the files.

Dale

I agree that the epub folks need to address this and other issues. I like your idea of a meta file. There is nothing preventing its incorporation into an existing epub. However, as the standard doesn't address this issue, the use of this data would be entirely dependent on how or if the reader made use of it.

As to predefined page sizes, that somewhat negates the benefits of a reflowable format and still ties us to the archaic concept of a page. Besides, you might read a particular epub on anything from a smart phone, to a 22 inch widescreen monitor. When you add in the different font sizes that might be used for reading, it adds up to a lot of combinations. And if only a certain subset of these combinations was in the specification, that also limits what you can do with a reflowable format.

DaleDe
11-06-2007, 06:15 PM
Since it seems we all agree that paragraph numbers are probably the best we can do with today's reflowable ebook formats, we still have the question of how to use them without making them too obnoxious and in-your-face to the average reader.

In business use the number looks something like 1.1.1 where the dots are used to separate the levels with a variable number of dots and the least significant (one on the right) number is the paragraph number. Section numbers have less depth.

To find such a paragraph there needs to be a goto menu item same as used today to goto a page number.

To display this data really is a reader issue and there can be no predefined method to do it. In some documents it is setting at the start of the paragraph as a lead in. Professional tools like Framemaker can auto generate a document like this. It could be display as a optional setting in the control panel and turned on or off as needed by the user. If the device has a touch screen it can be implemented as a tap or gesture. For menu driven systems it could be done by selecting a menu item which would then display the number based on the cursor location and up/down arrows could be used to move the cursor location by paragraph.

As you can see there is no best way to implement the viewing but the idea that it exists is what has to be sold so that the tool can figure out that they need to do it. The database is likely to need to include a reference similar to the pre-pagination I have mentioned before so it does need to be in the standard.

Dale

bowerbird
11-06-2007, 06:17 PM
i'm sure the standards people would just love to
hear from someone who hadn't thought about this
up until a few days ago. fresh perspective, and all.

because i get the feeling the people at adobe have
never thought about this issue before, you know?

-bowerbird

jbenny
11-06-2007, 06:26 PM
Using epub as an example, one relatively simple way to do this would be to have two different stylesheets. One that displayed the chapter/section/paragraph numbering and one that hid it. The numbering information could be included in the text, but only displayed when using the proper stylesheet.

The problem with this is that I don't see how you can change stylesheets, as the epub standard doesn't support any scripting or programability. Using the standard as-is, you could provide a top level item in the TOC that allowed selecting either version. But without the ability to switch stylesheets, you would have to also provide two copies of the content, with the only difference being which stylesheet it linked to. Not ideal, but workable and relatively simple to implement.

DaleDe
11-06-2007, 06:28 PM
i'm sure the standards people would just love to
hear from someone who hadn't thought about this
up until a few days ago. fresh perspective, and all.

because i get the feeling the people at adobe have
never thought about this issue before, you know?

-bowerbird

funny you should mention Adobe which had good foresight a long time ago and seems to have lost their way. PDF was originally a print format and like pagemaker and earlier Post Script is a page oriented format. Keeping page numbers is practically automatic but still the reader that tries to report the page you are on gets it wrong as compared to the paper document since they weren't foresighted enough to realize that the fist page of the electronic document is not page one of the paper document. It is frustrating to try a reference a page number from a pdf without looking on the page itself to see if it is listed.

Dale

jbenny
11-06-2007, 06:29 PM
i'm sure the standards people would just love to
hear from someone who hadn't thought about this
up until a few days ago. fresh perspective, and all.

because i get the feeling the people at adobe have
never thought about this issue before, you know?

-bowerbird

Bowerbird, if you can't do anything but criticize, then take your comments elsewhere. This is just a friendly discussion and whether the standards-setting folks take any notice or not is up to them. Unlike you, we don't think we know everything and have all the answers.

wallcraft
11-06-2007, 06:45 PM
I was under the impression that .epub can already handle "physical pages" via its toc.ncx file. From The NCX (http://www.niso.org/standards/resources/Z39-86-2005.html): The user will also have the option of navigating to items that do not fit easily into the hierarchical structure of a document, e.g., pages, footnotes, or sidebars. This function is provided by pageList (for pages) and navList (for all other non-hierarchical objects).

This does not solve how to come up with pages, but it does provide a standard way to reference them in the e-book.

I would really like e-book readers to use physical (normative?) page numbers, e.g. in their navigation bar, when they are available. Since on small screens the e-book page count gets too large to be useful.

jbenny
11-06-2007, 06:50 PM
I was under the impression that .epub can already handle "physical pages" via its toc.ncx file. From The NCX (http://www.niso.org/standards/resources/Z39-86-2005.html):

This does not solve how to come up with pages, but it does provide a standard way to reference them in the e-book.

I would really like e-book readers to use physical (normative?) page numbers, e.g. in their navigation bar, when they are available. Since on small screens the e-book page count gets too large to be useful.

I must have missed that on my initial read-through of the spec. There is actually quite a bit in there. I'll have to take a closer look. Thanks for pointing this out.

OK, I see that you were referencing the DTBook spec, which is incorporated into the epub spec. The documentation on the IDPF site glosses over some of this. Thanks again for pointing to the full DTBook spec.

sartori
11-06-2007, 07:14 PM
I've been playing around with representing print versions online as faithfully as possible see sample (http://britdesigner.com/sample.html). Unfortunately I can't see any way this would translate into a reflowable page size.

(This is just a sample and was more of an experiment to see how it could be done)

jbenny
11-06-2007, 07:23 PM
I've been playing around with representing print versions online as faithfully as possible see sample (http://britdesigner.com/sample.html). Unfortunately I can't see any way this would translate into a reflowable page size.

(This is just a sample and was more of an experiment to see how it could be done)

Very nice! I'm impressed. That must have taken a lot of work to dupicate the original.

sartori
11-06-2007, 08:00 PM
Ok, been playing around with adding paragraph markers to my sample (http://britdesigner.com/sample2.html) as suggested earlier in this thread. Just a quick question - do any of the current html->lrf converters respect css hidden properties? If so it wouldn't be too hard to created a library of books that display paged as in my example but then you could easily convert them to lrf and ignore page numbers, etc. (It would be time consuming but not difficult).

This could almost become a master library that looks good online for people doing research and referencing certain sections/pages but also great for those who want to just read them on their portable device.

bowerbird
11-06-2007, 08:08 PM
jbenny said:
> Bowerbird, if you can't do anything but criticize,
> then take your comments elsewhere.

gee, i hope you're not _criticizing_ me, or you can "take it elsewhere"...

besides, i think "criticizing" is a _lot_ better way to get to the bottom
of a topic than blowing sunshine up someone's behind. don't you?

plus, as if it is the case that "the only thing" i am doing is "criticizing".
i invite anyone to take a look at the 3 posts of mine that i linked to.
you will find more meat in them than in this "new thread" combined...

or look at the post of mine in this "new thread" that mentions ted nelson
and you'll find more meat there than all the other posts here combined.

but, you know, have a nice day, and all... :+)

-bowerbird

bowerbird
11-06-2007, 08:18 PM
sartori said:
> I've been playing around with representing print versions
> online as faithfully as possible see sample.

nice example. good work. thank you very much.

is it time-consuming? could you do all 612 pages?'

-bowerbird

DaleDe
11-06-2007, 08:24 PM
I've been playing around with representing print versions online as faithfully as possible see sample (http://britdesigner.com/sample.html). Unfortunately I can't see any way this would translate into a reflowable page size.

(This is just a sample and was more of an experiment to see how it could be done)

Really nice work at making it look like a book. Very close to a PDF. It would translate to a smaller page just fine but, of course, would not look the same. The text would all wrap differently and the TOC would have to be formatted a little different. There is nothing magic about a particular page size except that we get used to looking at it in that size. If you first saw this document formatted for a 6x9 paper back book then you would likely think that was how it would always look.

Dale

sartori
11-06-2007, 08:26 PM
Those pages I added were time consuming but mainly because I was figuring out the layout. I do plan on working through the whole book but I haven't found a plain text version available so I am ocr'ing the pdf from archive.org. This is currently the slowest part as I am proofing and converting quotes and dashes over.

Right now it's more the challenge on seeing how it could be done and figuring out any of the quirks that may crop up.

For example, if you increase the display font size in your browser, the pages expand lengthwise to accommodate it. It just runs into problems with items that are specifically positioned, such as the table of contents. I think I'll continue playing with this and see what I can come up with.

kovidgoyal
11-06-2007, 08:48 PM
Ok, been playing around with adding paragraph markers to my sample (http://britdesigner.com/sample2.html) as suggested earlier in this thread. Just a quick question - do any of the current html->lrf converters respect css hidden properties? If so it wouldn't be too hard to created a library of books that display paged as in my example but then you could easily convert them to lrf and ignore page numbers, etc. (It would be time consuming but not difficult).

This could almost become a master library that looks good online for people doing research and referencing certain sections/pages but also great for those who want to just read them on their portable device.

html2lrf will ignore tags that have display=none set

jbenny
11-06-2007, 08:49 PM
Those pages I added were time consuming but mainly because I was figuring out the layout. I do plan on working through the whole book but I haven't found a plain text version available so I am ocr'ing the pdf from archive.org. This is currently the slowest part as I am proofing and converting quotes and dashes over.

Right now it's more the challenge on seeing how it could be done and figuring out any of the quirks that may crop up.

For example, if you increase the display font size in your browser, the pages expand lengthwise to accommodate it. It just runs into problems with items that are specifically positioned, such as the table of contents. I think I'll continue playing with this and see what I can come up with.

There is also a PDF copy at Google Books:
http://books.google.com/books?id=j-sNAAAAYAAJ&printsec=titlepage&dq=library+of+the+worlds+best+literature

They have apparently OCRed the text, as you can "view text" for each individual page. Sadly, the downloadable PDF doesn't include the OCRed text. That would have saved you some effort.

jbenny
11-06-2007, 08:53 PM
html2lrf will ignore tags that have display=none set

That's good to know. Being based on XHTML, epub should also respect the "display=none" attribute. I'll have to see if Digital Editions honors this. The Lector plugin most certainly should.

sartori
11-06-2007, 08:54 PM
kovidgoyal,

So if I was to create a secondary css file that hides all the page breaks and page numbers and just displays the text with simple formatting (ie justified, centered, different sizes) html2lrf would be able to create a decent looking lrf from the file?

jbenny
11-06-2007, 08:56 PM
Hey, did you check Gutenberg? I just saw that they have six volumes.

http://www.gutenberg.org/browse/authors/w#a993

sartori
11-06-2007, 09:00 PM
Hey, did you check Gutenberg? I just saw that they have six volumes.

http://www.gutenberg.org/browse/authors/w#a993

Thanks, for that - I just checked those out and they appear to be from a slightly different version than the ones on archive.org (and they have all 31 volumes). As my goal is to represent the printed version, the differences may become a problem with page numbers being different.

jbenny
11-06-2007, 09:04 PM
Thanks, for that - I just checked those out and they appear to be from a slightly different version than the ones on archive.org (and they have all 31 volumes). As my goal is to represent the printed version, the differences may become a problem with page numbers being different.

Too bad it is a different version. It would have saved you a lot of work with the OCR part on at least those six volumes.

Well, good luck with the project. What you have so far looks very nice.

Panurge
11-06-2007, 10:16 PM
I'm rather surprised that my (admittedly minor) point has generated such a discussion, so allow me to make one or two more:
Scholarly citation is meant to serve two main purposes:
1. establish the authority for a reference so that if someone cares to check your accuracy or honesty, the location of the quotation or reference can be pinpointed and verified;
2. provide a context for a quotation or reference so that the reader can understand the total argument or occasion to which it belongs.
I am convinced that electronic forms of delivery will ultimately prevail; if future readers can locate the exact source with ease (perhaps even greater ease than was possible in the print world--hyperlinks, search engines, whatever works), then we don't need page numbers. We do need to know how closely the electronic version resembles its print source.
However, there is sometimes more information in a print or handwritten source than can be easily captured in its digitized version. Medieval manuscripts, an English scholar realized recently, can sometimes be dated and associated more precisely by using DNA information from its parchment (aka, sheepskin) and ink media. Yet, as the digitization of the Beowulf manuscript also showed, high-resolution and other scanning techniques can also reveal aspects of the original that would otherwise be impossible to recognize. When you've got only one copy (like the Beowulf manuscript), you need all the help you can get.
So the original is irreplaceable for the scholar, in many cases, because its verbal content is only part of the information it contains.
Perhaps in the future we will find a way to capture all the information we are likely to need for the foreseeable future, but then there are always surprises, as the identification of parchment provenance using DNA analysis illustrates. At some point we'll simply have to draw the line and admit that we can't do everything; some information will have to be lost. The goal of the user of a particular document will determine if that loss is critical, incidental, or trivial.
For most of us, it won't matter. But for archeologists of the text, it will.

bowerbird
11-06-2007, 10:26 PM
panurge said:
> then we don't need page numbers.

we still need them, because prior aspects of the record
use them. we cannot forfeit all those earlier pointers...


> We do need to know how closely
> the electronic version resembles its print source.

and, for that, we need to sync the two. by page number.
(because, realistically, what else are we going to use?)


> there is sometimes more information
> in a print or handwritten source
> than can be easily captured in its digitized version.

that's a different problem. but we always had that one.
there's no substitute for access to the original, at least
for some things. still, for a good many _other_ things,
access to a digital copy is better than nothing, _much_
better than we used to have (i.e., which was nothing...)

if you have feedback on the numerous examples i gave,
i'd love to hear it. if not, that's fine too...

-bowerbird

kovidgoyal
11-06-2007, 10:36 PM
kovidgoyal,

So if I was to create a secondary css file that hides all the page breaks and page numbers and just displays the text with simple formatting (ie justified, centered, different sizes) html2lrf would be able to create a decent looking lrf from the file?

It wont display the hidden elements. Whether the resulting LRF will look good or not depends on the kind of HTML you use. But I'm always willing to add support for more esoteric HTML to html2lrf, within reason :-)

sartori
11-06-2007, 10:48 PM
It wont display the hidden elements. Whether the resulting LRF will look good or not depends on the kind of HTML you use. But I'm always willing to add support for more esoteric HTML to html2lrf, within reason :-)

Ok, thanks. I think I'll play around with this tomorrow and see if I can come up with a 'plain' css version of the same page.

Panurge
11-07-2007, 12:06 AM
[> then we don't need page numbers.

we still need them, because prior aspects of the record
use them. we cannot forfeit all those earlier pointers...


> We do need to know how closely
> the electronic version resembles its print source.

and, for that, we need to sync the two. by page number.
(because, realistically, what else are we going to use?)]

Page numbers are simply a way of keeping track of pages. The earliest printed books don't have them. For incunabulae, the books published in the second half of the 15th century, there were numbers, not of pages but of groups of pages, so that when the book was put together for binding the sections would not be out of order. Manuscripts may or may not have page numbers. Sometimes the first word of the following page was printed (or written) at the bottom of the preceding page to establish sequence.
What really counts, for the most part, is textual accuracy--that is, identity of the two texts. For routine purposes, one wouldn't have to refer to the original if the electronic copy were certifiably accurate. But there's the rub, perhaps. When I edit an older text, say an unprinted manuscript, I'm not usually obliged to give its original page numbers. I just need to identify the original source and signal each time I depart from its authority (for example, to correct an obvious error in spelling or printing).
The scholarly world has had many ways of ensuring synchronization between two texts; page numbers are one but not the only one. Of course they are helpful, but historically printers have sometimes ignored them. In the case of Greek and Latin texts, individual passages were identified by paragraph and sentence numbering, and that is still used among classicists today, as was observed above.
So, yes, I agree that page numbers are useful for synchronizing two versions of a text; in the case of verse, however, we go by line numbers and larger divisions or sections of the poem. So the physical page isn't always what matters.
My only intention in bringing up this matter was to point out that digitization of books in the future may not be as simple a matter as we would like and that there is no one solution that will fit some of these odd cases. Nor will past practice always be a reliable guide to what will work in the future. At some point electronic texts will be recognized as the accepted authority, and page numbers will no longer matter; for us, in a time of transition, they still do on occasion, depending on our relationship to what we're reading.

Let me say that as someone who guards, keeps track of, and preserves books from harm, I'm delighted to see such a vigorous discussion about how to address the problem and find solutions. We are in a time of tremendous change that will have at least as much impact on the distribution of information as resulted from the invention of moveable type, and groups like this one are at the forefront because they include not simply programmers and designers but regular readers and enthusiasts who understand the users' needs. More power and glory to them.

Panurge
11-07-2007, 01:16 AM
Perhaps I should have also said "because they include not simply regular readers and enthusiasts but also programmers and designers." I'm looking forward to examining all the examples that have been posted in this thread as soon as I can get the time to do so.

bowerbird
11-07-2007, 05:47 AM
my guess is that, of the 7 million volumes google will scan at umichigan,
just as one example, 99.98% of them will have pagenumbers in them...

those are the books that will form the cyberlibrary of the future, and thus
those are the books that we need to find a way to make _pointers_ into...

as pagenumbers have been the pointer-system used on them up until now,
we'll need to create digital means so that we can continue to support that,
and that infrastructure will allow us to continue using pagenumber pointers.

yes, we'll have other means too, but we'll need to make pagenumbers work.
luckily, as i believe i've shown in the examples i've posted, it's not too hard.

-bowerbird

HarryT
11-07-2007, 08:16 AM
I teach a fair amount of Plato and Aristotle and find online texts are a problem.When referring to Plato, it is essential to use Stephanus numbers, which will identify any sentence in his entire oeuvre. These appear as marginal numbers and letters in most print versions in both English and Greek. But the numbers simply don't appear in the online versions of Plato (except for the Perseus Project version). So I can't recommend them to students and don't use online versions myself.

(This is why I've never uploaded a Plato dialogue: without the Stephanus numbers it is useless to me. But with them it is irritating to general readers.)

The same problem exists with Latin and Greek poetry, Patricia. One always refers, for example, to Iliad, Book 8, line 204 and without the line numbers of the original a text is of very limited value.

NatCh
11-07-2007, 11:15 AM
As to predefined page sizes, that somewhat negates the benefits of a reflowable format and still ties us to the archaic concept of a page. Besides, you might read a particular epub on anything from a smart phone, to a 22 inch widescreen monitor. When you add in the different font sizes that might be used for reading, it adds up to a lot of combinations. And if only a certain subset of these combinations was in the specification, that also limits what you can do with a reflowable format.I see what you're saying, jbenny, and I agree. However, I was suggesting that we define a single page size/margin/font combination and use it for all references, which would get around having to handle multiple ones. :shrug:

This morning, however, some of the overnight comments, have got me hinking maybe we're making this too complicated.

We're talking about computers here, and computers do boring, repetitive functions fast and without complaint. Why not have the reading application generate some sort of text index? It could be as simple as a straight character count (which would get ... rather large), or it could be some sort of graduated count by chapter and then paragraph and then character. For instance, 10.3.400-475 would be chapter 10, paragraph 3 starting at character 400, running to character 475.

I'm not pushing for that specifically, just making a "top of my head" example.

The important bit is that it be an agreed upon standard, and that it be repeatable. The reading app can generate the reference and locate the point in the text from the reference. Of course, those needs will have to be met whatever the eventual system ends up being. :shrug:

NatCh
11-07-2007, 11:47 AM
besides, i think "criticizing" is a _lot_ better way to get to the bottom of a topic than blowing sunshine up someone's behind. don't you?When the comment is little more than distilled sarcasm, with no actual content, it's not criticism, in my book, It comes closer to sniping. :shrug:

But then, I don't regard discussing and exploring solutions in a respectful manner to be "blowing sunshine up someone's behind" either. I guess I gave up the personal illusion that I could give the Final, Infallible, and Only Answer on sweeping matters some time ago.

One of the side-effects of discussing things politely, and respectfully, even when the discussers disagree, is that people continue to consider what's being said, and don't skip, blow off, or otherwise Ignore comments by people who discuss things in such a fashion.

Having the best point in the world, or being absolutely right is pretty worthless if no one will listen. And if no one listens because they're tired of the tone the commenter takes with those who disagree with him is really rather sad. :shrug:

plus, as if it is the case that "the only thing" i am doing is "criticizing". i invite anyone to take a look at the 3 posts of mine that i linked to. you will find more meat in them than in this "new thread" combined...

You're referring to these, I believe:
panurge, i feel where you're coming from. but let me run through a few thoughts.

so first, point #14 is about the embedding of pagenumbers inside of the text flow.
that's not a good idea, because they're a distraction that just needs to be removed
when we want to copy the text out for remixing. that's why point #14 is there.

my next comment -- which i say because it must be said -- is that it's not our job
to do your job. if the pagenumbers are valuable to you, it's your job to save them.
i'm sorry if that sounds cold, but that's the way it is.

having said that, however, let me move on to my next comment, which is that
i am in 100% agreement with you. even though pagenumbers are _irrelevant_,
in many senses, when we move a book to the digital sphere, i'm convinced that
we still need to retain pagenumber information, simply because so much of our
archival history uses pagenumbers as pointer-information. we cannot afford to
sacrifice that. indeed, i go one step further and argue that we should also be
retaining the _linebreak_information_ from all the paper-books that we digitize.
i won't go into all the arguments here, but in my mind, the answer is now clear.

furthermore, i put my money where my mouth is. in my digitization examples,
i maintain linebreaks and pagebreaks, and put the image-scan up next to the text,
so the end-user can verify the accuracy of my digitization if they want to do that.
i consider this checking by end-users to be the last fine line of the proofing process,
and i want them to feel like a part of the "march to perfection" that the text makes,
because i believe we need to make the public feel like "joint owners" of these books.
"the public domain belongs to _you_, the public, and you have responsibility for them,
so if there are errors here, you need to fill out an error-report so they are corrected."

to see some of my examples, check these out:
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html

you can thumb through these e-books just like they were the p-books,
and verify that the linebreaks and pagebreaks are exactly as they were.
and if you find an error, you can fill out an error-report right on the page.
and once someone has made a report, it's immediately visible to everyone,
even if it might take an administrator a little bit of time to fix the error...

now examine the plain-text versions of the files that created those books above:
> http://z-m-l.com/go/myant/myant.zml
> http://z-m-l.com/go/mabie/mabie.zml
> http://z-m-l.com/go/sgfhb/sgfhb.zml

you'll see how the pagebreak information was recorded in those plain-text files.
i think you'll also see how easily that pagebreak information can be eliminated,
for the situations where an end-user doesn't care about the original pagebreaks.

this is the kind of flexibility we want from our digitization efforts, so each group
gets the information they like, without inconveniencing what another group gets.

what is also useful about this format is that it's extremely close to what we get
_naturally_ when we scan a book, so it's not hard to go from scan output to final.

now, having said all _that_, let me proceed to my final point, which is a variant
on the "don't expect us to do your job for you". and it is _not_ our job to make
"a faithful representation of the print copy". we don't even _want_ to do that --
even if we could -- and we _cannot_, because any time you move a document
from one medium to a completely different one, you're creating a new edition.
whether you mean to do it or not. and like i said, at least from my perspective,
i don't even think twice about things like the correcting of typos. heck, i'll even
rework headers -- or even the _body_ of the text -- if that is what it takes to
make this _digital_version_ a _good_ digital version. i'm a republisher, who is
moving this book into a new medium for a new world in a new century, and
i'm going to do justice to the new. it's simply not my job to snapshot the old.
if you want to see what the old pages looked like, you can look at the scans.

so, anyway, there's some feedback for you to think about... :+)

-bowerbird

first, a few things i forgot to mention on pagenumbers.

one very important aspects of pagenumber references
is that we need to consider them in our u.r.l. naming,
and the links there must have maximal transparency...

up above, i pointed you to these references:
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html

take the top one, and eliminate the first part, to get:
> myant/myantp001.html

you can see that the first 5 letters are repeated, so
eliminate those as well, and strip off the suffix, for:
> myantp001

in my naming, the first 5 letters reference one book.
in this case, it's "my antonia", the book by willa cather.

the "p001" part of the u.r.l. indicates this is page 1...

and just so you know, this u.r.l.:
> http://z-m-l.com/go/myant/myantp001.html
is based on the page-scan with this name:
> http://z-m-l.com/go/myant/myantp001.png
which, once again, is the page-scan for page 1.

and i rigorously follow this convention throughout.

so this is the u.r.l. for page 123:
> http://z-m-l.com/go/myant/myantp123.html

and it's based on the page-scan with this name:
> http://z-m-l.com/go/myant/myantp123.png

thus, any competent fourth-grader is capable of
figuring out the u.r.l. for _any_ page in this book.

furthermore, this means that when i encounter
some other p-book in the historical archive that
makes references to this edition of "my antonia",
i can relate those references to my e-book easily.

for instance, let's say that a passage runs like this:
> on page 189 and 198, cather ascribes qualities
> to antonia which seem to be inconsistent with
> those which were ascribed on page 15 and 83,
> and are completely contradictory to what cather
> clearly states on page 111. however, this could
> be due to the revelation which antonia has, that
> is described in detail on pages 144 and 157.

so, based on my transparent and consistent naming,
it's a simple exercise to create links for this passage:
> http://z-m-l.com/go/myant/myantp189.html
> http://z-m-l.com/go/myant/myantp198.html
> http://z-m-l.com/go/myant/myantp015.html
> http://z-m-l.com/go/myant/myantp083.html
> http://z-m-l.com/go/myant/myantp111.html
> http://z-m-l.com/go/myant/myantp144.html
> http://z-m-l.com/go/myant/myantp157.html

you would be _astonished_ how many cyberlibraries
have messed up their naming-schemes, such that a
simple plug-in-the-numbers strategy doesn't work.

google gets it kind-of right, but almost everyone else
gets it wrong, wrong, utterly and completely _wrong_.

and because of their confusing naming conventions,
scholars will have to go back and muddle through
_each_and_every_ reference like this, to find out how
the exact link for each one is specified in the e-book.
this is nothing less than sheer and massive stupidity...

-bowerbird

p.s. and, for the record, notice how completely useless
a p.g. e-text -- which was stripped of pagenumbers --
will be for a person who encounters the above passage.

in that x.m.l.-based version of "my antonia" i discussed above,
i forgot to provide an example of a link direct to a paragraph.

here's one:
> http://www.openreader.org/myantonia/basic-design/myantonia.html#p0251
you should read the paragraph directly after that one as well...

-bowerbird

Part of the reason those posts are not getting much exposure here may be this:i see no reason for a new thread, and won't repeat my posts here:

I find the sentiment expressed in that comment particularly ironic, knowing, as I do, that the thread you referenced is one created specifically for the purpose of pulling an interesting topic that you brought up in yet another thread out where it could get the exposure it seemed to deserve. I'd've done the same thing for Panurge's topic (even though it involves a bit of trouble to do), if he hadn't beaten me to the punch, so to speak.

In any case, now that the posts in question are here where the discussion is continuing, others may find in them points worth responding to.

nekokami
11-07-2007, 01:15 PM
As a doctoral student, I'm pretty much stuck with having to reference printed page numbers, but I'd like to see a transition to paragraph numbers in the future, to better support electronic reflowable documents. I think we'll have to support both for the foreseeable future, to allow references to pre-electronic documents, even those that have been converted to digital form. Some kind of embedded semantic tagging for each of these methods of identifying text location that can be shown or hidden at will would be great.

sartori
11-07-2007, 01:15 PM
Natch, thanks for bringing over the info from the other thread (I kind of gave up reading that thread after the 'debates' started.) After reading through the post above I have a question.

Bowerbird, can i ask the reasoning behind splitting the document into individual pages? Couldn't you point to the page content using http://z-m-l.com/go/myant/myantp.html#189 as opposed to http://z-m-l.com/go/myant/myantp189.html. That way the whole content of the book is in one file and conversion to other formats would be easier. For example how do you recognize when a paragraph splits across two pages and how do you join them back together when converting? You might have a good reason that I haven't considered so I would like to hear your take on it.

bowerbird
11-07-2007, 03:01 PM
natch said:
> When the comment is little more than distilled sarcasm,
> with no actual content, it's not criticism, in my book,
> It comes closer to sniping.

"with no actual content"?

did you not get the content in that post of mine?

if so, then let me explain it to you a little bit more directly...

_lots_ of people have already spent _lots_ of time and energy
thinking about these questions, running up solutions, and
actually putting _even_more_ of their own time and energy
to code experimental solutions so that they could be tested.

the results have largely confirmed what most of us suspected,
namely that there is no reliable way to point to a piece of info
if someone (else) has the ability to change that info any time,
up to and including the option of completely _removing_ it...

because, hey, it's hard to point to something that ain't there.

a fact which -- in retrospect -- seems to be fairly "obvious",
and which might have been a tip-off from the very beginning
that maybe this was one of those problems with no solution...

because, realistically_, that _is_ the situation which we're in.
someone (else) _is_ going to have control over the info that
we want to point to. it's called copyright, and it's our burden.

furthermore, when someone here suggests that the people
over at i.p.d.f. should pay some attention to this question,
that implies that i.p.d.f. has _not_ paid any attention to it...

when the fact of the matter is that they _have_. they've paid
more attention than you know, including enough attention
to understand (which y'all here don't seem to have grasped)
that this is one of those problems with no solution, or at least
no "really good solution".

so to imply that they "need to consider this" is _stupid_...

so here's my choice. i can either use a little bit of sarcasm,
which -- last i checked -- is considered a form of _humor_
(albeit not as happy-go-lucky and feel-good as slapstick),
or i can instead go for the "explain everything to them like
they were a bunch of second-graders, and let the fact that
they've ignored some basic reality give the solid impression
that they're not just second-graders, but kinda stupid ones,
even though that ain't the impression i _want_ to leave...".

i went for the form of humor. was that a mistake?

-bowerbird

bowerbird
11-07-2007, 04:23 PM
sartori said:
> Bowerbird, can i ask the reasoning behind
> splitting the document into individual pages?

first of all, my e-books _can_ exist in several forms.
the individual-pages form is just one of those forms.
but i can (and do) spin out "whole-book" forms too...
(plus chapter-by-chapter forms, for some purposes.)

i pointed to the page-by-page form because this topic
-- page-based referencing within scholarly situations --
is one whose basic requirements call out for that form...

to see the "master file" for "my antonia", look here:
> http://z-m-l.com/go/myant/myant.zml
(as you see, the master itself is in whole-book form.)

that "master" _generated_ the page-by-page form...

the page-by-page form has many intended purposes.

its first major purpose is to facilitate _proofreading_...
you want to do proofreading on a page-by-page basis;
you want the page-scan to be shown alongside the text;
and you want the text to contain the original linebreaks.
this format is geared toward those proofreading needs...
(this is a "final-stage" proofing interface, where errors are
"reported", because there are very few. for earlier stages
of proofreading, where there might be many more errors,
we'll want an interface that lets us fix them more directly.)

the next major purpose of it is for _confirming_accuracy_.
we want to give people an ability to confirm our digitization,
to satisfy themselves we did that conversion job correctly...
to do so, we show them our text and the original page-scan,
so they can do a direct comparison and see for themselves...

the third major purpose is the one we're discussing here --
the ability for people to make a pointer to a specific page...
and -- as i have said -- the reason we need to facilitate that
is because our culture heritage is full of page-based pointers.
and again, we _could_ point them to a place with just the text,
but everyone knows that text can be easily "edited", so we also
put the original page-scan up so as to increase the trust factor.
(of course, scans could _also_ be doctored, but at some point,
there's only so much you can do.)


> Couldn't you point to the page content using
> http://z-m-l.com/go/myant/myantp.html#189
> as opposed to
> http://z-m-l.com/go/myant/myantp189.html.

sure.

and sometimes that's what you'll want to do instead.

but let me show you something. stopwatch this link:
> http://z-m-l.com/go/myant/myantp189.html

now check the length of time it takes to go to this one:
> http://www.openreader.org/myantonia/basic-design/myantonia.html#page189

unless that second page was already in your cache or
you have a superfast connection, it took _lots_ longer
to load, because you're loading in some 500k of text
-- the whole book -- instead of 1k of text and a scan.
(for the dialup users, the second file will be _painful_.)

so it depends on what you need your readers to load...
if you only need them to load one page of text, do that.
if you need them to load the whole book, then do _that_.

you'll notice that the second link doesn't include the scans
in-line in the file; you have to click a link to view each one.
(the scans run to 30 megs, so it'd be suicide to load 'em all.)

so it depends on what you need.

if you wanted to point to one page in each of 50 books,
you wouldn't want to force your reader to load each of the
50 books in full just to see that one page. and this is often
the essence of a scholarly reference section. so it depends.

this is why we need the flexibility to quickly and easily
auto-generate whatever format is needed at the time...


> That way the whole content of the book is in one file
> and conversion to other formats would be easier.

in sum, i pointed to a page-based form because of this discussion...
i can also create book-based forms when _that_ is more appropriate.

(such flexibility is one reason i invented my z.m.l. format, which is
a sidetrack topic in that other thread from which this one came...)


> For example how do you recognize when a paragraph splits across
> two pages and how do you join them back together when converting?

good question. but easy answer.

in a "master" file which has pagebreaks marked, like this one:
> http://z-m-l.com/go/myant/myant.zml
the formula for generating a version _without_ the pagebreak info is to:
1. delete the _one_ blank line _above_ the [[doublebracketed]] pagenumber,
and delete the _one_ blank line _below_ the {{doublebraced}} scan-filename...
2. if there were _two_ blank lines above and below, respectively, then that
was a paragraph break, so you should insert a blank line in the output file.

if you follow that rule, you'll find that paragraphs which cross pagebreaks
get joined together, while the ones that ended on the pagebreak still do...
for instance, in the .zml master, compare the breaks between these pages:
> http://z-m-l.com/go/myant/myantp040.html
> http://z-m-l.com/go/myant/myantp041.html
versus:
> http://z-m-l.com/go/myant/myantp061.html
> http://z-m-l.com/go/myant/myantp062.html

see how easy it was for me to point you to those pages specifically?
and also the _usefulness_ of being able to see both text _and_ scan?

-bowerbird

NatCh
11-07-2007, 04:46 PM
so here's my choice. i can either use a little bit of sarcasm, ... or i can instead go for the "explain everything to them like they were a bunch of second-graders, and let the fact that they've ignored some basic reality give the solid impression that they're not just second-graders, but kinda stupid ones, even though that ain't the impression i _want_ to leave...".There's another choice, bowerbird. Talk to people like they're actually functionally intelligent, and point out the point you feel they're missing, without sarcasm or abrasive phrasing, and describe the implications of that point as you see the in a similarly non-sarcastic, non-abrasive manner.

Sarcasm is indeed used humorously on the forum a great deal, but it has to be well telegraphed as humor, because things like tone of voice don't come through in text without a good deal of effort, and they can easily be taken the wrong way. Because of that it also requires a willingness to step back from it and clarify what was meant when it doesn't come across as funny, even to the point of apologizing for giving offense that was never intended.

You come across as seeming to consider anyone who doesn't see things your way to be an imbecile, and people are starting to assume that you mean to be abrasive even when you don't. I've noticed this, but if you have, you have given no sign of it.

You have managed to get more folks to put you on ignore in a week than I've seen happen in the preceding almost two years that I've been around MR.

These are the results of the absence of the respect for which you have expressed such scorn: you are driving folks away even as you claim to wish to persuade them.

I, and several others have put significant amounts of effort in attempting to communicate this to you, but you seem to regard those efforts as aimed at getting you to shut up -- if the moderators here wanted to stifle you as you seem to believe we do, we wouldn't have resorted to talking to you to do so. The fact that we have ought to tell you something all by itself.

I've reached the point where I simply don't know what else to say to you.

bowerbird
11-07-2007, 05:07 PM
natch said:
> Talk to people like they're actually functionally intelligent

i do! whenever people strike me as being "functionally intelligent".

in addition, if they strike me as being stupid, i talk to them like that.

but people who want me to talk to them as "functionally intelligent"
when they're holding up their end of the conversation with stupidity,
they get _sarcasm_ from me. because that's the best they _deserve_.


> even to the point of apologizing
> for giving offense that was never intended.

did you _intend_ to offend me with this sentence?

or with your post as a whole?

more importantly are you ready to _apologize_ for doing so?


> you are driving folks away even as you claim to wish to persuade them.

hold it there. i never said i "wish to persuade" _anyone_ of _anything_.
in fact, i expressly disclaim that as an intention, wholly and completely.
frankly, i don't care what anyone thinks, if they disagree or agree with me.
i speak my mind, and you can make of it whatever you wish, fine by me...


> You have managed to get more folks to put you on ignore in a week than
> I've seen happen in the preceding almost two years that I've been around MR.

some people don't want to hear anyone else speak frankly. so what?
others take offense much too easily, especially the insecure. so what?

i too ignore a lot of what i read here, because it has very little truth value.
it doesn't make sense. when i weigh it as evidence, it registers no mass...
i don't bother to filter out what people say, because i've found that it's not
generally a good idea to stick my head in the sand, but if other people want
to stick their head in the sand, i'm totally fine with that. indeed, i would prefer
that people put me on "ignore" than try to chastise me for speaking my truth.
i'm not "rude". i'm a gentle soul who believes in truth, and has enough respect
for my fellow human beings to be honest with them when they're being stupid,
honest enough to tell them directly. if you think that's a bad thing, i suggest
that you too put me on "ignore", so my words will magically be turned into
white space and you live in ignorant bliss. sincerely, i want you to be happy.

-bowerbird

sartori
11-07-2007, 05:18 PM
Bowerbird,

Your reasoning makes sense to me (In response to my question). So some more questions if you don't mind

When you receive an error notification do you just update the master file then regenerate the paged version? or vice-versa? or for small updates do you just make the change in both versions?

On page 61 (http://z-m-l.com/go/myant/myantp061.html) I noticed that a few words are hyphenated across lines. On your master view the words are correctly joined (tea-kettle & followed). Were these manually corrected or automated? If automated did it correctly catch tea-kettle should keep its' hyphen?

I'm not sure if z.m.l. is the way I want to go with my formatting but I'm still at the early stages of formatting so I'm just checking out options (googling for ebook markup languages is hopeless as you just get a ton of responses that are actual ebooks).

Thanks,

rob

bowerbird
11-07-2007, 05:57 PM
sartori said:
> So some more questions if you don't mind

i don't mind a bit. that's why i'm here, to discuss...


> When you receive an error notification do you just
> update the master file then regenerate the paged version?
> or vice-versa? or for small updates
> do you just make the change in both versions?

if you go to the directory now, you'll see a bunch of files:
> http://z-m-l.com/go/myant/
including all of the .html files to which i've been linking...
the .html files were generated in a batch from the master.

but eventually, all the separate .html files will disappear.

they'll be replaced by a script which intercepts links like this:
> http://z-m-l.com/go/myant/myantp061.html
and creates that .html file on-the-fly...

so yes, any correction will be made to the master, after which
the script will include it when it builds the .html file next time.


> On page 61 (http://z-m-l.com/go/myant/myantp061.html)
> I noticed that a few words are hyphenated across lines.
> On your master view the words are correctly joined
> (tea-kettle & followed).

um, as far as i can tell, you're mistaken. here's the master:
> http://z-m-l.com/go/myant/myant.zml

what i see there, in the master, is this:
> Peter shuffled to his feet, caught up the tea-
> kettle and mixed him some hot water and
> whiskey. The sharp smell of spirits went
> through the room.
>
> Pavel snatched the cup and drank, then
> made Peter give him the bottle and slipped
> it under his pillow, grinning disagreeably,
> as if he had outwitted some one. His eyes fol-
> lowed Peter about the room with a contempt-
> uous, unfriendly expression. It seemed to
> me that he despised him for being so simple
> and docile.

do you really see something different? if so, that's a mystery...


> Were these manually corrected or automated?
> If automated did it correctly catch tea-kettle should keep its' hyphen?

not all of the example-files that i have up are correct on this point yet,
but they'll be marked as to whether an end-line hyphen is kept or not...

so, if "tea-kettle" -- with the dash -- is the form used in this book
(when the word is mid-sentence), then the master will look like this:
> Peter shuffled to his feet, caught up the tea-@
> kettle and mixed him some hot water and
(i haven't decided if we'll use the at-sign, but you get the idea.)

on the other hand, if this book uses "teakettle", the master will say:
> Peter shuffled to his feet, caught up the tea-
> kettle and mixed him some hot water and

(for the record, this book does indeed use "tea-kettle" in the one
other instance where the word occurs. in the cases where there is
no other use of an end-line hyphenate, we consult the dictionary.
when there is inconsistency within a book, we edit to consistency.)


> I'm not sure if z.m.l. is the way I want to go with my formatting but
> I'm still at the early stages of formatting so I'm just checking out options

i definitely suggest light-markup. "markdown" is the current favorite,
if you want broad support. my tool-change is approaching coherence,
so you could get the job done, but markdown gives you more reliability.
google "showdown" and "markdown" for an interesting real-time demo:
> http://www.attacklab.net/showdown-gui.html

-bowerbird

sartori
11-07-2007, 06:09 PM
Bowerbird - sorry didn't mean the master I meant the html view that you listed.

I like the showdown stuff - seems a little limited as far as layout but it looks really easy to use.

Thanks.

bowerbird
11-07-2007, 06:29 PM
sartori said:
> I like the showdown stuff - seems a little limited as far as layout

depends on what you want to do with a book,
and what platforms you want to put it out to...

you can often exercise tight control in _one_ setting,
but then it blows up on you when you try to move it...

a good rule of thumb is that if you cannot do it with
light-markup, then you shouldn't be doing it anyway,
because it's not gonna convert well to other settings.

so living with some "limitations" from the beginning
can save you a truckload of heartburn done the road.

but, you know, your demo showed you've got chops...
so i'd encourage you to let your mind experiment fully.

-bowerbird

bowerbird
11-07-2007, 06:32 PM
sartori sadi:
> sorry didn't mean the master I meant the html view that you listed.

except i still don't follow.
the individual-page .html file shows end-line hyphenates just like the scan:
> http://z-m-l.com/go/myant/myantp061.html

-bowerbird

Panurge
11-07-2007, 11:09 PM
Kovidgoyal: [EDIT: An example from physics research articles. A resolution of sections is usually sufficient. i.e. people refer to section so-and-so of paper so and so.
I don't know if that is sufficient resolution in general though.]

Yes, I think that "resolution" is the problem. Paragraph numbers would probably work well for everything but poetry, though in some cases--such as the one you mention--larger units might be more practical. Page numbers work if one can pinpoint the exact edition (publisher, place, date, in addition to title and author) being referenced; that was the contribution of printing. For manuscript copies, logical divisions such as sections or paragraphs or line numbers (for verse) were the only alternative. But are such things needed for electronic documents that can be searched for exact phrases? Presumably not. So long as one can identify the electronic source one is referring to, searching would suffice. But there's the rub. There is no system of cataloguing material that is purely electronic in origin. The URL of a web site, for instance, is an unstable identifier, as we have learnt very quickly in the last decade or so. Printed books have that data, but what kind of unique identifier do electronic documents offer? There's no central clearing house, no Library of Congress or OCLC (the online cataloguing authority for books) or ISBN number as of yet.
When Michael Hart (an academic) started Project Gutenberg, he seems to have encouraged embedded page numbers in ASCII text for the reasons we've already discussed. So far electronic documents are a sort of free-floating, indistinct mass of various kinds of information. Without some standards of granularity or resolution, research will become too unwieldy; the Internet search engine demonstrates the problem all too well.
As a librarian (rather than as a programmer, who finds useful and efficient ways of designing specific solutions), I have to worry about this sort of thing increasingly. Page numbers are, in a manner of speaking, the tip of the iceberg.
Speaking of Google books (which BowerBird mentions above), shouldn't someone point out to them that the scanning is being rather carelessly executed? I keep running into instances of books that are so poorly positioned that part of the text is cut off, to say nothing of the page numbers.
--------------------------------------------------------------------------------

bowerbird
11-08-2007, 04:17 AM
sartori said:
> those pages I added were time consuming
> but mainly because I was figuring out the layout.
> I do plan on working through the whole book
> but I haven't found a plain text version available
> so I am ocr'ing the pdf from archive.org.
> This is currently the slowest part as I am
> proofing and converting quotes and dashes over.

um, gee, you might be missing something very important.

if you got it from archive.org, then it was almost certainly
scanned by the o.c.a., which means that -- right alongside
the .pdf copy -- you should find the o.c.r. they did on it...

i couldn't find volume 1, but some volumes from this series
certainly have their text available. sometimes you need to
click on the "ftp" link to find _all_ of the files that they offer.
if you see nothing labeled as ".txt", seek the "djvu.text" file.

however, in a spectacular display of sheer incompetence,
sometimes the text files are burdened by severe problems,
some of which can even border on fatal. i won't bother to
go into the details here, but check the text _carefully_ first,
before going on to pour work into it, or you might regret it.

so you might well end up doing o.c.r. on the .pdf anyway.
but i'd still suggest you should check out their text first...


> For example, if you increase the display font size
> in your browser, the pages expand lengthwise
> to accommodate it. It just runs into problems with items
> that are specifically positioned, such as the table of contents.

another problem that you need to be aware of -- which might
or might not be something you consider serious -- is when a
paragraph is split across a pagebreak -- as they usually are --
because then the text won't fill out the bottom line of the page,
which is what people expect to see in that situation. the reflow
(will often) end in the middle of the line, the impression is that
the paragraph has ended, which can be disconcerting to people.


> If so it wouldn't be too hard to created a library of books
> that display paged as in my example but then you could
> easily convert them to lrf and ignore page numbers, etc.

but if the page as displayed doesn't fit correctly on the screen,
then you'll have "pagebreaks" occurring mid-screen, correct?
which kind of defeats the whole purpose of a paged display...

***

jbenny said:
> They have apparently OCRed the text

um, well of course google does o.c.r. on the scans.
how else would they be able to do searches on it?


> as you can "view text" for each individual page.

they do that so as to provide access to the visually-impaired.


> Sadly, the downloadable PDF doesn't include the OCRed text.

that's because they don't really want you to have the text.
well, they probably don't care if _you_ have it, but they
don't want all of the _other_ search engines to have it...

***

sartori said:
> I just checked those out and they appear to be from
> a slightly different version than the ones on archive.org
> (and they have all 31 volumes). As my goal is to represent
> the printed version, the differences may become a problem
> with page numbers being different.

so strange. did this series with 31 volumes _really_ go through
several editions? i guess it's not impossible, but it'd suprise me.
are you sure that it's not just _flakiness_ in the p.g. digitization?

because one appealing aspect of the p.g. versions in general is
that they've been subjected to some proofreading, which means
-- if nothing less -- that you can compare them to your output,
because the differences between the two versions will point to
errors in one (or both) of them. indeed, this has provem to be
one of the _most_ effective ways of doing "proofing" on a text...

-bowerbird

DaleDe
11-08-2007, 12:35 PM
Kovidgoyal:
Yes, I think that "resolution" is the problem. Paragraph numbers would probably work well for everything but poetry, though in some cases--such as the one you mention--larger units might be more practical. Page numbers work if one can pinpoint the exact edition (publisher, place, date, in addition to title and author) being referenced; that was the contribution of printing. For manuscript copies, logical divisions such as sections or paragraphs or line numbers (for verse) were the only alternative. But are such things needed for electronic documents that can be searched for exact phrases? Presumably not. So long as one can identify the electronic source one is referring to, searching would suffice. But there's the rub. There is no system of cataloguing material that is purely electronic in origin. The URL of a web site, for instance, is an unstable identifier, as we have learnt very quickly in the last decade or so. Printed books have that data, but what kind of unique identifier do electronic documents offer? There's no central clearing house, no Library of Congress or OCLC (the online cataloguing authority for books) or ISBN number as of yet.


I think paragraphs work for everything. After all a Stanza of poetry is really a paragraph in effect. The idea is that the logical unit of the author is the paragraph. It is a cohesive thought and can often be determined easily even when the editions change. Paperback vs. hardback makes page number useless again.

you are correct that text can be searched but sometimes there are duplicates if you fail to type enough and searches sometimes fail on word wraps. A specific reference is what is needed when referencing someone else's work.

You also raise a good point about electronic text not being able to be identified. This is particularly interesting when web pages are inherently copyrighted. It would seem that copyrights are not enforceable if you can't produce the original.

Dale

nekokami
11-08-2007, 12:39 PM
You also raise a good point about electronic text not being able to be identified. This is particularly interesting when web pages are inherently copyrighted. It would seem that copyrights are not enforceable if you can't produce the original.
What about the Internet Archive?

DaleDe
11-08-2007, 12:46 PM
What about the Internet Archive?

Are you talking about google? I think it is not guaranteed over the long haul.

Dale

bowerbird
11-08-2007, 01:48 PM
panurge said:
> Yes, I think that "resolution" is the problem.
> Paragraph numbers would probably work well
> for everything but poetry, though in some cases
> --such as the one you mention--larger units might be
> more practical.

ok, i'll try one more time.

_nothing_ will work if you don't have stable documents.
_nothing_. so stable documents is a necessary condition.

fortunately, stable documents is also a _sufficient_ condition.
once you have stable documents, just about _any_ system will
work, and work just fine, so you don't need to worry about it...


> Page numbers work if one can pinpoint the exact edition
> (publisher, place, date, in addition to title and author)
> being referenced; that was the contribution of printing.

assuming that you have an infrastructure of stable documents,
the u.r.l. to a document is the "pinpoint" to "the exact edition."

every document points to its "official" u.r.l., so you can compare it
with the document that appears at that u.r.l., and if it is the same,
it hasn't been tampered with, and you know it's a "legitimate" copy.

as with everything else, a system with stable documents makes it
_easy_, whereas it's difficult -- often to the point of impossibility --
in a system without stable documents.


> For manuscript copies, logical divisions such as sections or
> paragraphs or line numbers (for verse) were the only alternative.

and we need to "update" all those archival pointers for the new system.
whatever pointer that one document used to point to another document
needs to be "converted" so the electronic version of the first document
points to the correct place in the electronic version of the second one...


> But are such things needed for electronic documents that
> can be searched for exact phrases? Presumably not.

unless your infrastructure is explicitly using "search" as its methodology,
in which case it's automatic, you don't want to force users to do search
just to activate a pointer. they'll wanna be able to click directly to a point,
and that's a reasonable expectation about a capability we should give 'em.


> So long as one can identify the electronic source one is referring to,
> searching would suffice.

again, the source is unequivocally identified by virtue of its u.r.l.
and even if searching would "suffice", it's not convenient enough.


> But there's the rub. There is no system of cataloguing material
> that is purely electronic in origin. The URL of a web site, for instance,
> is an unstable identifier, as we have learnt very quickly in the last decade

the current system, one which permits unstable documents, won't work.

we need another system -- it could be built on top of the current one --
that has _only_ stable documents in it. this means we can still have the
unstable system -- there's no need to replace it, as it works fine for a
good many purposes -- it just means we have to create another system
that's fully intended to be a permanent archive for dependable reference.

as i said, this stable system could even be built on top of the current one.
if we incorporated a "datestamp" into the u.r.l., and then made sure that
we archived _everything_ that was _ever_ put on the web (which is not
as absurd as it sounds, since we're _almost_ doing it already), then we
will essentially _have_ the stable infrastructure that's required, at no cost.
(the wayback machine at internet archive is the best example of this now.)


> Printed books have that data, but what kind of unique identifier
> do electronic documents offer?

none. until, that is, we give them one. which isn't difficult to do at all...


> There's no central clearing house, no Library of Congress or OCLC
> (the online cataloguing authority for books) or ISBN number as of yet.

don't need that. wouldn't want that. this is an easy problem to solve.
it just requires always-getting-cheaper diskspace, and the commitment.


> Speaking of Google books (which BowerBird mentions above),
> shouldn't someone point out to them that the scanning is being
> rather carelessly executed?

oh, it's been pointed out. over and over and over and over and over.
even by some of its big supporters, like me. repeatedly. problem is,
it just doesn't seem to be sinking in, not quite as deeply as it should.
(they _have_ improved. but quality, and quality-control, is still awful.)

-bowerbird

bowerbird
11-08-2007, 02:00 PM
dalede said:
> I think paragraphs work for everything.

well, almost _anything_ will "work for everything"
if everyone agrees on how it will be implemented.


> After all a Stanza of poetry is really a paragraph in effect.

i know some people who would argue with you about that.
for a long time. they'd call you bad names for saying that.


> The idea is that the logical unit of the author is the paragraph.

maybe in your mind. but other authors could be very different.


> It is a cohesive thought and can often be determined
> easily even when the editions change.

i can show you edition-changes with changed paragraphs.

(but that's really neither here nor there, because any system
has to consider different editions to be different documents.
every pointer has to be relative to a specific edition, or else
you start getting into all kinds of very confusing messiness.)


> Paperback vs. hardback makes page number useless again.

not really. even if the pagination is different between the two
-- and sometimes it's not, but that's beside the point here --
when you're making a link, you simply link to one or the other...


> you are correct that text can be searched but sometimes
> there are duplicates if you fail to type enough and
> searches sometimes fail on word wraps. A specific reference
> is what is needed when referencing someone else's work.

there are some people who say that, because this issue is so
thorny right now, in our world of unstable documents, that
any text that you want to quote should just be included in
your own document. it's easy enough with copy-and-paste.

("what if you want to cite a whole article or book?", you ask.
then a system based on _search_ won't work for you anyway,
which is part of the problem with specifying such a system...)

-bowerbird

nekokami
11-08-2007, 04:49 PM
Are you talking about google? I think it is not guaranteed over the long haul.

Dale
No: http://www.archive.org

DaleDe
11-08-2007, 05:12 PM
No: http://www.archive.org

Interesting. I didn't know that it existed. I searched it for my name and got zero hits but if I search google I get almost 10,000 hits so I think their search engine isn't too good. Thanks for the information.

Dale

nekokami
11-08-2007, 07:29 PM
I think it works best if you have an actual website to search for. I've managed to use it to dig up all kinds of pages that have disappeared over the years.

DaleDe
11-08-2007, 08:09 PM
I think it works best if you have an actual website to search for. I've managed to use it to dig up all kinds of pages that have disappeared over the years.

Thanks, that works for me. I can see this is valuable. Looks like my site has about 8 years of history stored there.

Dale

Panurge
11-08-2007, 11:03 PM
DaleDe: "I think paragraphs work for everything. After all a Stanza of poetry is really a paragraph in effect. The idea is that the logical unit of the author is the paragraph. It is a cohesive thought and can often be determined easily even when the editions change. Paperback vs. hardback makes page number useless again."

Unfortunately, much poetry is not in stanzas, especially when it is written in blank verse.

kovidgoyal
11-08-2007, 11:10 PM
Well there's no real reason why you cant have paragraphs that are a single line long for blank verse.

Panurge
11-08-2007, 11:14 PM
BowerBird: 'it is _not_ our job to make "a faithful representation of the print copy". we don't even _want_ to do that -- even if we could -- and we _cannot_, because any time you move a document from one medium to a completely different one, you're creating a new edition.'

I don't think a representation of the print copy in its visual layout is necessary, just an exact transcription of the content. I want to know that no words (or other linguistic elements, such as punctuation and paragraphing, for instance) have been added or substracted without some sort of indication that something was changed from the original source. That has been solid scholarly practice from the beginning, which is not the same as a facsimile. If the electronic version is guaranteed identical in that sense, then I would rarely need page numbers.

By the way, I find BowerBird's examples very interesting.

sartori
11-09-2007, 01:10 AM
I don't think a representation of the print copy in its visual layout is necessary, just an exact transcription of the content.

Panurge,

While I agree with the above statement when applied for research purposes. I feel that a decent representation of the source layout can add to the enjoyment when reading for pleasure. The attached image from Alice In Wonderland is one example where the layout of the text adds something to the book. I think if you start out with a solid facsimile of the original it's pretty easy to convert that to plain text or any other format for research purposes.

Of course, I agree with you that laying this out can be done pretty easily for a web page but as soon as you try to convert it for different devices like the sony reader it's pretty much a lost cause.

Ultimately I would love it if you could start with a master document that is a facsimile of the original print version for viewing online and then export it for individual devices and have the server 'automatically' remove any markup that is not supported for that device.

Just my 2 cents.

Rob

http://www.britdesigner.com/0054.jpg

kovidgoyal
11-09-2007, 01:20 AM
The problem is that you simply cannot reproduce a static layout faithfully in a reflowable format that will work at different sizes. They are two different things. I think we just have to accept that.

sartori
11-09-2007, 01:27 AM
I've been thinking about the issue of identifying which version of a document you may be looking at while researching. For example:

Say I quote chapter 3, paragraph 11 from a book listed on site 1 that is listed as Alice In Wonderland.epub. Somebody looking at my work decides to lookup the quote from a document called Alice In Wonderland.epub on site 2. The only problem is site 2 has marked the paragraph starting point incorrectly so my reference makes no sense.

I have read that each epub document (and probably most others) require an ID number. Could this ID number be a 10 digit checksum generated from the actual content of the html source? That way, even if one character is changed in the source the checksum would change.

Then when I reference my quote it could be something like Chapter 3, Paragraph 11 - Alice In Wonderland.epub [5684937643]. It should be pretty easy to create a tool that would verify the checksum I typed. Now I could verify any document as being the same one originally referenced no matter where the file was obtained.

Edit: Of course this does nothing to help verify that the document I quoted from was correct in the first place.

Rob

jbenny
11-09-2007, 03:44 AM
I've been thinking about the issue of identifying which version of a document you may be looking at while researching. For example:

Say I quote chapter 3, paragraph 11 from a book listed on site 1 that is listed as Alice In Wonderland.epub. Somebody looking at my work decides to lookup the quote from a document called Alice In Wonderland.epub on site 2. The only problem is site 2 has marked the paragraph starting point incorrectly so my reference makes no sense.

I have read that each epub document (and probably most others) require an ID number. Could this ID number be a 10 digit checksum generated from the actual content of the html source? That way, even if one character is changed in the source the checksum would change.

Then when I reference my quote it could be something like Chapter 3, Paragraph 11 - Alice In Wonderland.epub [5684937643]. It should be pretty easy to create a tool that would verify the checksum I typed. Now I could verify any document as being the same one originally referenced no matter where the file was obtained.

Edit: Of course this does nothing to help verify that the document I quoted from was correct in the first place.

Rob

You are refering to the "identifier" in an epub, which is one of three required metadata elements in an epub (title and language are the other two). There are several other metadata elements which are optional. In the following example, the place where I have put x's is where the identifier would go. This example is from the "content.opf". The same identifier also goes in the "toc.ncx", using a different statement.

<dc:identifier id="BookID">urn:uuid:xxxxxxxxxxxxxxxxxxx</dc:identifier>

Note that the identifier is required to be unique, such that no other epub should have the same ID.

For a commercial ebook, the identifier would be the ISBN. For ebooks without an assigned ISBN, some other means of identifying the ebook is needed. Unless I missed it in the OPS specification, I don't see that it recommends any particular method. However, a UUID (GUID) seems to be the most logical solution, as discussed elsewhere on this forum (and the format of the above statement even implies the use of a UUID). Feedbooks is using a UUID for epubs, according to Hadrien.

Assuming that a new ISBN or UUID is used whenever an edited or updated version of the original epub is created, this would take care of identifying a particular edition.

The identifier would seem to preclude using it as a checksum, due to the need for uniqueness. However, one of the optional metadata fields may be useable for such use. In fact, I don't see anything that says you can't use your own unique metadata element for this purpose. Of course, getting everyone to use such a method is another issue.

Adding a checksum (or better, a hash) would be a useful addition to the epub specification. You could certainly use it to verify that the contents haven't changed, as you suggested. Again, this may not be important for the casual reader, but people need to think about and find ways to accomodate ebook use by the academic community as well.

bowerbird
11-09-2007, 04:21 AM
panurge said:
> I don't think a representation of
> the print copy in its visual layout is necessary,
> just an exact transcription of the content.

well, i can understand a desire for that kind of product.

but i have absolutely no interest in making such a thing.

_my_ target is the human reader in the 21st century, so
i see my job as bringing that old p-book into cyberspace.

which can mean making a million little changes, and it's
not a good use of my time to keep track of all of them...

i mean, if you wanted to _pay_ me to do that job for you,
then i might consider it. (or not, since it'd be too boring.)

but i'm certainly not gonna use my volunteer time to do it,
because none of my target (readers) care about that stuff.

so you're just adding (immensely) to the _cost_ of the thing,
without providing any _benefit_ to my audience. so no deal.

-bowerbird

bowerbird
11-09-2007, 04:29 AM
panurge said:
> Now I could verify any document as being the same one
> originally referenced no matter where the file was obtained.

why not just point to a stable u.r.l. with the document you referenced,
so there is no need to jump through all of these verification hoops?

consider that you have to give such a u.r.l. to people _anyway_, for
those people who don't have a copy of the document to begin with...

-bowerbird

nekokami
11-09-2007, 09:29 AM
I agree with sartori and jbenny, adding a checksum/hash to the epub standard would be helpful. How would we formally suggest that to the committee?

bowerbird
11-09-2007, 12:31 PM
nekokami said:
> How would we formally suggest that to the committee?

i'm sure they have an e-mail address.

but you might want to spend just a _little_ bit of time
finding out how tenable your constructions really are,
and what that committee has already done, and the
research that's been performed in this field up to now
-- you know, just _educating_yourself_ on the topic --
before you consider making any "formal suggestions".

or maybe you won't want to do that, i don't know...

i'm not the boss of you, so i don't tell you what to do.

-bowerbird

jbenny
11-09-2007, 01:06 PM
I agree with sartori and jbenny, adding a checksum/hash to the epub standard would be helpful. How would we formally suggest that to the committee?

There is a forum at the IDPF web site. Not very active, but someone may pay attention to a posting there.

Alexander Turcic
11-09-2007, 01:18 PM
i mean, if you wanted to _pay_ me to do that job for you,
then i might consider it. (or not, since it'd be too boring.)

bowerbird,

This is the only time I am going to say it, and at this point it's not a subject for debate: We do not tolerate personal attacks, flaming, disruptive behavior, or even insensitive remarks. You may have a different sense of what should be regarded as insensitive, and that's your personal right, but if you want to continue participating in our forums, I ask you to change your attitude and maintain the spirit of mutual respect within our welcoming and cordial community.

This has never been an issue at MobileRead before. I am tired of receiving messages from users who feel irritated and offended by your constantly denigrating behavior.

It's up to you.

bowerbird
11-09-2007, 01:50 PM
alexander-

i have one question. it's a serious question.
i would genuinely appreciate an answer from you.

are there really people here who consider
a suggestion to "educate yourself" to be
a "personal attack" or "flaming" or
"disruptive behavior" or even "insensitive",
people who send you messages saying they
"feel irritated and offended" by that?

because, you know, maybe i'm in the wrong place.

-bowerbird

Alexander Turcic
11-09-2007, 02:42 PM
There is a distinct difference between what you say and how you say it. If you don't understand this, then you may be right with your assessment that MobileRead isn't the right place.

bowerbird
11-09-2007, 03:15 PM
alexander-

since only the words themselves -- the "what" -- are there,
how can you -- or anyone -- tell me "how" i am saying them?

and when i inform you -- quite directly, with no uncertainty --
that i am _not_ "attacking" or "flaming" or "disrupting" or even
"being insensitive", do you mean to tell me that you understand
my internal motivations better than i do, and that i'm _wrong_,
because i really do intend to be doing all those negative things?

do you really believe you can honestly say you know me so well?

my messages are rorschach blots. each person needs to
take responsibility for the way that _you_ interpret them.

i am a gentle soul who writes posts from a good heart,
with a soft sweet voice, and the very best of intentions.
i speak the truth as i see it, because people deserve it.

i always stay as cool as a cucumber, never get heated,
and am willing -- indeed _proud_ -- to "own my words"
-- both here and now, and for the many decades to come.
i've never written a single post, anywhere, i'd take back...

so i'll repeat that, because i want it to sink in:
i am a gentle soul who writes posts from a good heart,
with a soft sweet voice, and the very best of intentions.

although i cannot force people to interpret them that way,
i can -- and _do_ -- take umbrage when they attempt to
force their own interpretation as being reflective of _me_.

because -- from my perspective -- that's extremely rude.
it's dishonest, and disrespectful of me, as a human being.

so yeah, maybe i _am_ in the wrong place...

-bowerbird

Alexander Turcic
11-09-2007, 03:19 PM
That's it. Already too many threads got side-tracked by talking about you rather than about the subject of the thread. Either stay and change your attitude or find another place where people have a better understanding of what you'd like to achieve.

bowerbird
11-09-2007, 03:29 PM
alexander said:
> Already too many threads got side-tracked by
> talking about you rather than about the subject of the thread.

i agree with you.

-bowerbird

bowerbird
11-09-2007, 05:00 PM
panurge said:
> Page numbers are simply a way of keeping track of pages.
> The earliest printed books don't have them. For incunabulae,
> the books published in the second half of the 15th century,
> there were numbers, not of pages but of groups of pages,
> so that when the book was put together for binding the sections
> would not be out of order. Manuscripts may or may not have
> page numbers. Sometimes the first word of the following page
> was printed (or written) at the bottom of the preceding page
> to establish sequence.

i responded to this earlier by saying that my guess would be that
99.98% of the 7 million volumes that google scans at umichigan
would have pagenumbers, and that's where we needed to focus...

however, i wanted to get back to this, and additionally note that
the cyberlibrary of the future will (hopefully!) consist of more than
the books that were sitting on library shelves in our universities...

there is _so_ much more content out there than just those books...

there are pictures, and maps, and genealogical charts, and books
on local history that were published in very small runs and didn't
spread much farther than a few city libraries down the road, and
city council minutes, and local newspapers, and school calendars,
and blueprints of buildings, and diagrams of public sewer systems,
and aerial photos of the coast and roads and farms and villages,
and books of poetry, and correspondence (both public and private),
diaries and dirty magazines, and more and more and so much more.

in today's e-mail is a notice that the archives of earnest hemingway
have been made available, having been donated to the j.f.k. library:
> http://www.jfklibrary.org/Historical+Resources/Hemingway+Archive/

also today, a very interesting notice about the scapbooks of suffragettes:
> Miller NAWSA Suffrage Scrapbooks, 1897-1911
> http://memory.loc.gov/ammem/collections/suffrage/millerscrapbooks/

just two examples of some fascinating aspects of our cultural heritage
that are _not_ bound within the pages of the books in our universities.

or here's a story about the new library director at harvard university:
> http://www.thecrimson.com/article.aspx?ref=520414
he intends to write a book on book-smuggling across the french border
during the 18th century, a book that he says was inspired by an archive
of some 50,000 unpublished letters that he found in an old swiss town.
putting those letters online, so each one of them could be accessed by
a person reading this book, is one of the things that becomes possible
when we've made the committment to put our cultural heritage online.

it's very unclear whether society will have the intelligence to _fund_
the digitization of all these other elements of our cultural heritage.
i can't even lie to you and say that i think we will. but nonetheless,
we need to make a referencing system that can point to these _other_
things just as efficiently as it points to the _pages_ in library books...

however, the system that i've built for those pages in library books
also proves to be well-suited for those other purposes, fortunately.

if you look closely, you'll see that i built a very tight binding between
the _page_ of each book and the _u.r.l._ which "houses" that page...

for instance, here's the u.r.l. for page 123 of "my antonia":
> http://z-m-l.com/go/myant/myantp123.html

ignore the first part -- http://z-m-l.com/go/ -- refering to my site.
and the last part -- p123.html -- that's pointing to one specific page.

the secret-sauce here lies in the _middle_part_ -- "myant/myant"...

that secret-sauce middle-part helps create a unique website address,
and that tells us _exactly_ what book we are in, with no uncertainty,
since one -- and only one! -- file can be present at this exact u.r.l.,
so there's no question about "which" version of "my antonia" this is
-- it's the version which is located at this webpage. by definition...

we will have _other_ versions of "my antonia" -- the second printing
of this same edition, or a completely different edition, or a version
to which we have added a complex set of annotations, or whatever --
but each of those _other_ versions will have to be at _another_ u.r.l.,
because the version at _this_ particular u.r.l. is this particular version.

so, by pointing to this u.r.l., we're indicating one-and-one-only version,
and there's absolutely no ambiguity on which version we're referencing.

and since you'll remember that i've stipulated that we have stable u.r.l.'s,
there's also no uncertainty that a pointer to this u.r.l. will _always_ point
to this specific version, and _only_ it, now and _forever_ into the future...
so anyone who wants to know, exactly and explicitly, what you reference
can simply go there and see it, immediately. so we're on the same page.
(sorry, i can never resist using that one.) ;+)

the same logic applies to the other items we have in our stable archive...

each one of those 50,000 letters mentioned above would be located at
its own page, so we know we can identify and reference every single one,
unequivocally and unmistakably.

in the example u.r.l. given above -- a page from a book -- it was _vital_
that the u.r.l. give some kind of clue about the contents of what it held...

amazingly, a lot of cyberspace libraries get this wrong, wrong, all wrong.

not only is there an absence of a bond between the content and its u.r.l.,
sometimes there is an _inability_ to link a filename to _specific_ content,
because different files have actually been given the very same filename!

trot on over to "mirlyn", the online library for the university of michigan,
for example, and poke around in their electronic-holdings. try here:
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?id=mdp.39015003659078&seq=123

now, save that image to your hard-drive, and you'll find its name was:
> 00000147.tif.100.0.png

the "100" and the "0" and the stuff at the end are viewing options.

the _meat_ of this filename is this:
> 00000147.tif

first of all, notice that this page was _actually_ page _123_ in the book.
so giving it a filename of "00000147.tif" is... well, it's kind of ridiculous.

but it gets worse. much, much worse...

explore other books, and you'll see many have a file named "00000147.tif".

for instance, here's one, this time from "books and culture", by mabie:
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?id=mdp.39015016881628&seq=147
(this is actually page 141, so its filename isn't bonded to its content either.)

indeed, _every_ book that has more than 147 pages (counting frontmatter)
will have a file named "00000147.tif".

if you've got lots of books -- and umichigan has, quite literally, _millions_ --
giving a file from each one the name of "00000147.tif" is outright stupid...

that means they have to depend on the _folders_ (or _subdirectories_)
to tell them apart, which is a very bad accident just waiting to happen.
if files get written to the wrong directory, how would you ever know?
if a foldername accidently gets changed, how would you ever know?
only when some user comes and says, "hey, i was supposed to get
"my antonia" by willa cather, but i got "books and culture" by mabie,
so what's up with that?"

cyberspace libraries need to follow some common-sense rules:
1. a file's name _must_ identify the content it contains, unequivocally.
2. the same file _should_ have the same name, no matter where it is.
3. different files _must_ have different names (cannot have the same name).

there's more rules than that, but let's just stick with those for now.

but, as we've seen above, even a huge and sophisticated library like
the university of michigan cannot get even this simple thing right...
it's sad, i tell you, it's really sad...

on the bright side, however, once you follow this simple naming convention,
all of a sudden every document in your library has a unique name that can be
linked to a matching unique u.r.l., and referencing just became very simple...

for example, in the case of those 50,000 letters, i might give them names
that would correspond to their dates, or maybe their sender or recipient,
or some combination. (it's hard to say without first perusing their content.)
but whatever the names i gave them, they would then bond to their web u.r.l.

and this, ultimately, is the way that you can deal with "unnumbered pages".
you _give_ them a number, or a name, and that name becomes their u.r.l.

-bowerbird

Patricia
11-09-2007, 05:29 PM
Truly, it is difficult to believe Bowerbird's description of himself as a 'gentle soul' as credited beneath his name.

bowerbird
11-09-2007, 05:32 PM
patricia said:
> Truly, it is difficult to believe Bowerbird's description of himself
> as a 'gentle soul' as credited beneath his name.

***

alexander said:
> Already too many threads got side-tracked by
> talking about you rather than about the subject of the thread.

i agree with you, alexander...

-bowerbird

NatCh
11-09-2007, 05:42 PM
Truly, it is difficult to believe Bowerbird's description of himself as a 'gentle soul' as credited beneath his name.We can all edit our titles from the User Control Panel (http://www.mobileread.com/forums/usercp.php), Patricia, though the system does seem to have a stock of standard ones that it uses based on posting numbers, near as I can tell.

Panurge
11-11-2007, 12:35 AM
Bowerbird: "panurge said:
> I don't think a representation of
> the print copy in its visual layout is necessary,
> just an exact transcription of the content.

well, i can understand a desire for that kind of product.

but i have absolutely no interest in making such a thing.

_my_ target is the human reader in the 21st century, so
i see my job as bringing that old p-book into cyberspace."

I think that the recent posts have been getting back to the subject at hand, so let's clarify and sum up the main points. The questions under discussion seem to be 1) How do we identify the 21st century e-book as a unique object so that something on the order of an ISBN number can be assigned to it (there have been several suggestions so far)? 2) in the case of what we might call a "virtual reprint" or "electronic re-edition" of a book, which features are essential to preserve? and 3) what kind of markup language, format, or programming might accomplish these things?
In the past when books were reprinted, they were reset typographically for the most part, so that pagination was different. In some cases, the American and British editions of the same book from the same publisher were different--to the point of exhibiting fresh typos in one or the other. Nevertheless, if you referred to a page number and gave the relevant publication data (place, publisher, edition--where different from the 1st edition--date of publication), you could at least locate the citation. If the text of a print book is accurately reproduced in an electronic edition, that should suffice for most purposes. One needn't preserve the original page numbers unless it is important to refer back to the original print edition.
But in some cases, that becomes important. Allow me to offer an example. Lewis Carroll was fastidious about the placement of illustrations in relation to text in Alice and Wonderland. Nevertheless, few reprints respect his intentions. There have been a few facsimile editions or photo-offset printings, including an excellent PDF e-book version a number of years ago that preserved the exact page layout of the original editions. Most don't. Here the PDF format served a useful purpose. The ability to reflow text would not be appropriate in this instance.
So one must discriminate. Most books have illustrations that are placed as closely as practical to the text which refers to them, but often as not they have to be located at some distance elsewhere in the volume (particularly if they are in color). Hyperlinks surely improve on the original in that case. Although for Alice in Wonderland, hyperlinked images might be an interesting alternative mode of presentation, they wouldn't represent the author's original intention. In some cases that would be difficult to accomplish because the image is tied to the format of a print book. Example: in Through the Looking-Glass, when Alice goes through the mirror, the two before and after images in the original edition were printed in exact register on the front and back of the same sheet, so that when you turned the page, she "passed" through the mirror, as it were. Again, few of the many subsequent print editions of the book after the author's lifetime respect that feature.
Of course, I am speaking of a small number of exceptions. Most books do not offer these kinds of peculiar problems of reproduction. In the vast majority of cases, even the page numbers are not important in an electronic edition, unless we regard the e-book as a mere ghost of the original, not as an edition in its own right. What makes it an edition in its own right is some way of identifying or cataloguing it as distinctive or unique, just like a reprint or new edition of a print book.
I hope that I am being clear about this point and not merely verbose; it has already taken far too many words to explain the nature of the problem. But if we want electronic texts to be universally acceptable and appreciated, especially among future scholars or academics, then we have to treat them at least as seriously as we have treated reliable print editions in the past. I think that the discussion has generated some very good ideas so far, and I hope that they can be further elaborated and put to use in a fruitful way, especially if we want all these wonderful volunteer efforts to make free e-books available to all and have them receive the respect they deserve--as I think we all do.

aapezzuto
11-11-2007, 06:41 AM
Im sorry to be jumping in this so late...

This is an interesting problem, but I think that it is important to accept that the current system is a hack-job as well. The current system is the equivalent to "gotten from page 57 of the book downloaded from www.something.com version 52.01b on sept 15 2005, viewed on a reb 1150, at medium font size, using default fonts." The exact printing, and page number have to be referenced, as well as enough information to distinguish the work. All this mess to get what level of granularity? You only get within a few paragraphs! So the first question in looking forward is what level of granularity do you want to have, and what should it be based on. Looking back the only real option is to have a page by page reference (paper or electronic) of every thing that has ever been printed... It is a broken system!

As far as reference, I think that a system of smaller granularity chunks until the desired resolution is gained would be idea. Choices like what sorts of content are "valid" at each level would be important, but in normal works chapter, paragraph, and then character range would be possible. Section, sentence, word would be similarly reasonable. The most important part of this would be having a standard. Having a standard edition to pull from is similarly important.

If a standard was adopted, the first big sign that it was being used would be that academic, printed journals would adopt the numbering in their publications. I think this would be just amazing... as the current system was accepted based on its retrospective convenience, and little else.

If such a system was adopted, I think the next question would be, how would you update all of the archaic references in previously printed works?

Life and times of Bilbo Baggins, page 54
might become
Life and times of Bilbo Baggins, v1.00: chapter 2: paragraph:16-19

thats not all bad is it?

nekokami
11-11-2007, 10:05 AM
What all this makes me think of is that a tool like the iLiad (which seems to be becoming preferred hardware for scholarly use of ebooks) ought to be able to tell the reader more granular information about where the reader is in the book, e.g. what paragraph the reader clicks on. Probably that would need to be an enhancement to FBReader and ipdf.

This doesn't in any way disagree with what's been said above about the importance of a unique identifier per edition, etc., I'm just noting that our ebook reading software currently doesn't display the relevant information even if we address those issues.

bowerbird
11-11-2007, 11:09 AM
first decide how to count paragraphs. (it's not that easy.)

kovidgoyal
11-11-2007, 11:44 AM
first decide how to count paragraphs. (it's not that easy.)

Only in txt files, apart from that it's not that hard ;-)

bowerbird
11-11-2007, 11:48 AM
provide the program to do it for (1) html, (2) pdf, and (3) rtf, and i'll believe you...

kovidgoyal
11-11-2007, 11:59 AM
See pdf2lrf, html2lrf and rtf2lrf. LRF has a well defined notion of what a paragraph is, so my mapping implicitly identifies paragraphs.

It's not perfect in that it will treat heading as paragraphs as well, but I don't see that being a problem for reference work.

And I cant really take the credit for pdf and rtf as I use other people converters to convert them to html first.

And also note that these mappings are not infallible. You can produce files that humans would think contain paragraphs but the converters dont. However, as demonstrated by the wide use of these tools they are largely successful on real world files.

bowerbird
11-11-2007, 12:52 PM
so the counting isn't "perfect", and the mappings are "not infallible". that's my point.

kovidgoyal
11-11-2007, 12:59 PM
And they never will be. If you think different, you're going to be disappointed. And really information isn't that fragile. A little bit of fuzziness doesn't matter.

bowerbird
11-11-2007, 01:06 PM
how much "fuzziness" do we want to allow in our referencing system? (i can't answer that.)

kovidgoyal
11-11-2007, 01:08 PM
As far as referencing is concerned, the fuzziness doesn't matter at all, since whatever fuzziness is there, it is consistent.

bowerbird
11-11-2007, 01:41 PM
huh? if a pointer points to the wrong place, because of miscounting, is that acceptable?

kovidgoyal
11-11-2007, 01:50 PM
Suppose you prepare a ebook in html say. Now you want to add a reference to a particular paragraph in the ebook. In order to do this you run some algorithm to find out what the number of the paragraph is (say by previewing the book in a viewer app, in reference mode). Then you add the reference. Now if a user of the book tries to follow that reference all he need do is use the same viewer app/algorithm to decode what that reference points to.

bowerbird
11-11-2007, 02:10 PM
right. we need to agree on a methodology. that's my point. so invent one everyone likes.

kovidgoyal
11-11-2007, 02:13 PM
As long as the author of the reference includes a link to the methodolgy she uses, there is no need for everyone to use the same methodolgy.

The solution you propose, inventing an ebook format that everyone likes, is far, far less likely.

bowerbird
11-11-2007, 02:28 PM
and if paragraph 81 means one thing to one author, and another thing to another, no big deal?

kovidgoyal
11-11-2007, 02:32 PM
As long as they work on separate texts yes. If they're working on the same text i'm sure they can agree on what app/algorithm to use for para numbering.

bowerbird
11-11-2007, 03:32 PM
you just said agreement wasn't necessary! now you say it is. ok, i'm off this merry-go-round...

kovidgoyal
11-11-2007, 03:39 PM
Sigh, do you ever actually bother to think about what I say? I said agreement between authors and readers is not necessary. Agreement between two authors working on the same text is obviously necessary.

bowerbird
11-11-2007, 03:52 PM
if other people get on your ride, they can go in circles... but i'm off the merry-go-round.

kovidgoyal
11-11-2007, 04:31 PM
if other people get on your ride, they can go in circles... but i'm off the merry-go-round.

You are now on my ignore list.

bowerbird
11-11-2007, 04:46 PM
that's great news. i'm glad to hear it!

GregS
11-11-2007, 05:48 PM
What a great thread.

I only have few minutes before work. Legacy page numbers are very important, so is the fact that different printed editions had different page numbers. For epub and other light-weight encodings, the idea of having a multitude of "Milestones/page numbers" cattered within the text is not very elegant.

The idea of having these in the a meta file seems a good elegant solution. But the references being external also need a way of identifying the exact positioning without planting targets in the body of the work (why not then put the references directly into the body?).

IDs are in sense already there. And there is no predicting what is the most important structural unit of every text, nor the particular numbering system that should be used.

But IDs are definitely the right thing to be using for the future.

All established numbering systems create (with xmlnamespace implied) unique references II.iii.486, gets me to the same place in Hamlet (it isn't broken so why fix it).

But in a meta-file how do I specify that the folio edition of Hamlet that page x eneded not only mid line, but mid word?

II.iii.486/4/3 - word 4 letter 3.

That is possible, but not catered for in any scheme I know of.

Legacy page numbers, and in some instance, multiple legacy page references can be essential. They should be within the etext , ideally, but there is much to be said for having them external - after all they need only surrender their positions on request, they do not have to be displayed. Hence identifying the particular IDs I am interested in, the edition I am referencing to, it is a trivial lookup to be given the page numbers (without the problem of which of the editions should display their numbers).

Fragmenting the CSS style sheet, and putting everything into the etext might also be possible, hence I specify for my usage, paragraph numbers to be displayed along with the page numbers (in the text) for a specific edition. It could overburden the standard and that has to be considered (ie making epub a poor substitute for TEI, which really it cannot do).

XMLnamespace has to better solved than it is if this technology is to remain useful for texts. Establishing a numbering systems via IDs is critical and should be a mandatory part of the any standard (with or without legacy page number references - I am not dismissing it, this is very important).

bowerbird
11-11-2007, 06:11 PM
different editions are stored at different u.r.l., so there's no conflict with differing pagenumbers.

GregS
11-12-2007, 03:07 AM
bowerbird It gets complex when dealing with academic texts.

A reference says page X of such and such edition, the text is the same in all editions, but finding it can be near impossible sometimes.

Hence a single edition of say a major work like Darwin's Origin of the Species, will have radically different page numbers, while the text is unchanged.

Ideally in a single serious digital edition, several important paper editions' page numbers should be included (for works of academic, study, educational interest) along with absolute references (to chapter and paragraph) via element IDs and xmlnamespace.

I have problems with URLs because of their impermanence.

bowerbird
11-12-2007, 03:19 AM
as i said, you need to remove the impermanence, or nothing will work. _nothing_.

when you make a reference, it's reasonable that it's to a specific page in a specific edition.

GregS
11-12-2007, 03:56 AM
PS In terms of numbering systems, lines, paragraphs etc.,. I think we need a few recommended models, but literature is just too diverse to use the same system of numbering for every possible work (what a about musical scores for instance?)

Element IDs presume nothing. Sequential numbering from biggest divisions through to smallest useful structure, which will be often the paragraph, but not always.

As it a unique sequential reference so long as you can get there unambiguously and refer to it usefully that is all that matters.

Take the Communist Manifesto that has numerous prefaces, but consistent enough text (bearing in mind different translations/editions).

German.Preface.1872.23 (I made the date up) consists of just two IDs a division one "German.Preface.1872" and a paragraph.

While the text itself has since publication used Chapter divisions, Subchapter divisions within which paragraphs reside, and if memory serves correct numbered lists.

Treating the list as paragraph is one option, or as another object hence Chapter.2.Section.3. list.1. item.3 works.

Same with illustrations, examples, quotes and other parts which may present problems being classified as paragraphs. Plates, may be placed in the text in a digital edition, but in the original it belonged within its own section. Hence listing them as Plate.3 makes sense.

Horses for courses.

What would be nice is mandating early that important structural divisions should be sequentially numbered, and perhaps a number of ideal examples.

In the end no matter how many examples are made, there will be some literature, through former usage or particularities of its composition may engender its own form.

The convention I would like to see implemented is inferred subdivisions using a slash, with IDS using the dot convention (allowing joined reference points as used above).

The last convention would be the use of the emdash to indicate a continuance from one place to the next Act.II.Scene.iii.45/3 -- 48/6.

Implied positioning means the subdivisions within the lowest number structure.

Paragraph.23/3 means the third sentence (the sentence marked or implied being the second possible structure of a paragraph) While 23/3/4, is the fourth word in that sentence.

Shakespeare lines have only words, etc.

We need to have normal commercial fiction using IDs for paragraph numbers, the form as long as it is consistent within the work, does not really have to follow any particular scheme (readability is important).