MobileRead Forums - View Single Post - Page numbers in ebooks for scholarly research?

bowerbird · 11-09-2007, 05:00 PM

panurge said:
> Page numbers are simply a way of keeping track of pages.
> The earliest printed books don't have them. For incunabulae,
> the books published in the second half of the 15th century,
> there were numbers, not of pages but of groups of pages,
> so that when the book was put together for binding the sections
> would not be out of order. Manuscripts may or may not have
> page numbers. Sometimes the first word of the following page
> was printed (or written) at the bottom of the preceding page
> to establish sequence.

i responded to this earlier by saying that my guess would be that
99.98% of the 7 million volumes that google scans at umichigan
would have pagenumbers, and that's where we needed to focus...

however, i wanted to get back to this, and additionally note that
the cyberlibrary of the future will (hopefully!) consist of more than
the books that were sitting on library shelves in our universities...

there is _so_ much more content out there than just those books...

there are pictures, and maps, and genealogical charts, and books
on local history that were published in very small runs and didn't
spread much farther than a few city libraries down the road, and
city council minutes, and local newspapers, and school calendars,
and blueprints of buildings, and diagrams of public sewer systems,
and aerial photos of the coast and roads and farms and villages,
and books of poetry, and correspondence (both public and private),
diaries and dirty magazines, and more and more and so much more.

in today's e-mail is a notice that the archives of earnest hemingway
have been made available, having been donated to the j.f.k. library:
> http://www.jfklibrary.org/Historical...ngway+Archive/

also today, a very interesting notice about the scapbooks of suffragettes:
> Miller NAWSA Suffrage Scrapbooks, 1897-1911
> http://memory.loc.gov/ammem/collecti...lerscrapbooks/

just two examples of some fascinating aspects of our cultural heritage
that are _not_ bound within the pages of the books in our universities.

or here's a story about the new library director at harvard university:
> http://www.thecrimson.com/article.aspx?ref=520414
he intends to write a book on book-smuggling across the french border
during the 18th century, a book that he says was inspired by an archive
of some 50,000 unpublished letters that he found in an old swiss town.
putting those letters online, so each one of them could be accessed by
a person reading this book, is one of the things that becomes possible
when we've made the committment to put our cultural heritage online.

it's very unclear whether society will have the intelligence to _fund_
the digitization of all these other elements of our cultural heritage.
i can't even lie to you and say that i think we will. but nonetheless,
we need to make a referencing system that can point to these _other_
things just as efficiently as it points to the _pages_ in library books...

however, the system that i've built for those pages in library books
also proves to be well-suited for those other purposes, fortunately.

if you look closely, you'll see that i built a very tight binding between
the _page_ of each book and the _u.r.l._ which "houses" that page...

for instance, here's the u.r.l. for page 123 of "my antonia":
> http://z-m-l.com/go/myant/myantp123.html

ignore the first part -- http://z-m-l.com/go/ -- refering to my site.
and the last part -- p123.html -- that's pointing to one specific page.

the secret-sauce here lies in the _middle_part_ -- "myant/myant"...

that secret-sauce middle-part helps create a unique website address,
and that tells us _exactly_ what book we are in, with no uncertainty,
since one -- and only one! -- file can be present at this exact u.r.l.,
so there's no question about "which" version of "my antonia" this is
-- it's the version which is located at this webpage. by definition...

we will have _other_ versions of "my antonia" -- the second printing
of this same edition, or a completely different edition, or a version
to which we have added a complex set of annotations, or whatever --
but each of those _other_ versions will have to be at _another_ u.r.l.,
because the version at _this_ particular u.r.l. is this particular version.

so, by pointing to this u.r.l., we're indicating one-and-one-only version,
and there's absolutely no ambiguity on which version we're referencing.

and since you'll remember that i've stipulated that we have stable u.r.l.'s,
there's also no uncertainty that a pointer to this u.r.l. will _always_ point
to this specific version, and _only_ it, now and _forever_ into the future...
so anyone who wants to know, exactly and explicitly, what you reference
can simply go there and see it, immediately. so we're on the same page.
(sorry, i can never resist using that one.) ;+)

the same logic applies to the other items we have in our stable archive...

each one of those 50,000 letters mentioned above would be located at
its own page, so we know we can identify and reference every single one,
unequivocally and unmistakably.

in the example u.r.l. given above -- a page from a book -- it was _vital_
that the u.r.l. give some kind of clue about the contents of what it held...

amazingly, a lot of cyberspace libraries get this wrong, wrong, all wrong.

not only is there an absence of a bond between the content and its u.r.l.,
sometimes there is an _inability_ to link a filename to _specific_ content,
because different files have actually been given the very same filename!

trot on over to "mirlyn", the online library for the university of michigan,
for example, and poke around in their electronic-holdings. try here:
> http://mdp.lib.umich.edu/cgi/m/mdp/p...659078&seq=123

now, save that image to your hard-drive, and you'll find its name was:
> 00000147.tif.100.0.png

the "100" and the "0" and the stuff at the end are viewing options.

the _meat_ of this filename is this:
> 00000147.tif

first of all, notice that this page was _actually_ page _123_ in the book.
so giving it a filename of "00000147.tif" is... well, it's kind of ridiculous.

but it gets worse. much, much worse...

explore other books, and you'll see many have a file named "00000147.tif".

for instance, here's one, this time from "books and culture", by mabie:
> http://mdp.lib.umich.edu/cgi/m/mdp/p...881628&seq=147
(this is actually page 141, so its filename isn't bonded to its content either.)

indeed, _every_ book that has more than 147 pages (counting frontmatter)
will have a file named "00000147.tif".

if you've got lots of books -- and umichigan has, quite literally, _millions_ --
giving a file from each one the name of "00000147.tif" is outright stupid...

that means they have to depend on the _folders_ (or _subdirectories_)
to tell them apart, which is a very bad accident just waiting to happen.
if files get written to the wrong directory, how would you ever know?
if a foldername accidently gets changed, how would you ever know?
only when some user comes and says, "hey, i was supposed to get
"my antonia" by willa cather, but i got "books and culture" by mabie,
so what's up with that?"

cyberspace libraries need to follow some common-sense rules:
1. a file's name _must_ identify the content it contains, unequivocally.
2. the same file _should_ have the same name, no matter where it is.
3. different files _must_ have different names (cannot have the same name).

there's more rules than that, but let's just stick with those for now.

but, as we've seen above, even a huge and sophisticated library like
the university of michigan cannot get even this simple thing right...
it's sad, i tell you, it's really sad...

on the bright side, however, once you follow this simple naming convention,
all of a sudden every document in your library has a unique name that can be
linked to a matching unique u.r.l., and referencing just became very simple...

for example, in the case of those 50,000 letters, i might give them names
that would correspond to their dates, or maybe their sender or recipient,
or some combination. (it's hard to say without first perusing their content.)
but whatever the names i gave them, they would then bond to their web u.r.l.

and this, ultimately, is the way that you can deal with "unnumbered pages".
you _give_ them a number, or a name, and that name becomes their u.r.l.

-bowerbird

11-09-2007, 05:00 PM	#105
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	panurge said: > Page numbers are simply a way of keeping track of pages. > The earliest printed books don't have them. For incunabulae, > the books published in the second half of the 15th century, > there were numbers, not of pages but of groups of pages, > so that when the book was put together for binding the sections > would not be out of order. Manuscripts may or may not have > page numbers. Sometimes the first word of the following page > was printed (or written) at the bottom of the preceding page > to establish sequence. i responded to this earlier by saying that my guess would be that 99.98% of the 7 million volumes that google scans at umichigan would have pagenumbers, and that's where we needed to focus... however, i wanted to get back to this, and additionally note that the cyberlibrary of the future will (hopefully!) consist of more than the books that were sitting on library shelves in our universities... there is _so_ much more content out there than just those books... there are pictures, and maps, and genealogical charts, and books on local history that were published in very small runs and didn't spread much farther than a few city libraries down the road, and city council minutes, and local newspapers, and school calendars, and blueprints of buildings, and diagrams of public sewer systems, and aerial photos of the coast and roads and farms and villages, and books of poetry, and correspondence (both public and private), diaries and dirty magazines, and more and more and so much more. in today's e-mail is a notice that the archives of earnest hemingway have been made available, having been donated to the j.f.k. library: > http://www.jfklibrary.org/Historical...ngway+Archive/ also today, a very interesting notice about the scapbooks of suffragettes: > Miller NAWSA Suffrage Scrapbooks, 1897-1911 > http://memory.loc.gov/ammem/collecti...lerscrapbooks/ just two examples of some fascinating aspects of our cultural heritage that are _not_ bound within the pages of the books in our universities. or here's a story about the new library director at harvard university: > http://www.thecrimson.com/article.aspx?ref=520414 he intends to write a book on book-smuggling across the french border during the 18th century, a book that he says was inspired by an archive of some 50,000 unpublished letters that he found in an old swiss town. putting those letters online, so each one of them could be accessed by a person reading this book, is one of the things that becomes possible when we've made the committment to put our cultural heritage online. it's very unclear whether society will have the intelligence to _fund_ the digitization of all these other elements of our cultural heritage. i can't even lie to you and say that i think we will. but nonetheless, we need to make a referencing system that can point to these _other_ things just as efficiently as it points to the _pages_ in library books... however, the system that i've built for those pages in library books also proves to be well-suited for those other purposes, fortunately. if you look closely, you'll see that i built a very tight binding between the _page_ of each book and the _u.r.l._ which "houses" that page... for instance, here's the u.r.l. for page 123 of "my antonia": > http://z-m-l.com/go/myant/myantp123.html ignore the first part -- http://z-m-l.com/go/ -- refering to my site. and the last part -- p123.html -- that's pointing to one specific page. the secret-sauce here lies in the _middle_part_ -- "myant/myant"... that secret-sauce middle-part helps create a unique website address, and that tells us _exactly_ what book we are in, with no uncertainty, since one -- and only one! -- file can be present at this exact u.r.l., so there's no question about "which" version of "my antonia" this is -- it's the version which is located at this webpage. by definition... we will have _other_ versions of "my antonia" -- the second printing of this same edition, or a completely different edition, or a version to which we have added a complex set of annotations, or whatever -- but each of those _other_ versions will have to be at _another_ u.r.l., because the version at _this_ particular u.r.l. is this particular version. so, by pointing to this u.r.l., we're indicating one-and-one-only version, and there's absolutely no ambiguity on which version we're referencing. and since you'll remember that i've stipulated that we have stable u.r.l.'s, there's also no uncertainty that a pointer to this u.r.l. will _always_ point to this specific version, and _only_ it, now and _forever_ into the future... so anyone who wants to know, exactly and explicitly, what you reference can simply go there and see it, immediately. so we're on the same page. (sorry, i can never resist using that one.) ;+) the same logic applies to the other items we have in our stable archive... each one of those 50,000 letters mentioned above would be located at its own page, so we know we can identify and reference every single one, unequivocally and unmistakably. in the example u.r.l. given above -- a page from a book -- it was _vital_ that the u.r.l. give some kind of clue about the contents of what it held... amazingly, a lot of cyberspace libraries get this wrong, wrong, all wrong. not only is there an absence of a bond between the content and its u.r.l., sometimes there is an _inability_ to link a filename to _specific_ content, because different files have actually been given the very same filename! trot on over to "mirlyn", the online library for the university of michigan, for example, and poke around in their electronic-holdings. try here: > http://mdp.lib.umich.edu/cgi/m/mdp/p...659078&seq=123 now, save that image to your hard-drive, and you'll find its name was: > 00000147.tif.100.0.png the "100" and the "0" and the stuff at the end are viewing options. the _meat_ of this filename is this: > 00000147.tif first of all, notice that this page was _actually_ page _123_ in the book. so giving it a filename of "00000147.tif" is... well, it's kind of ridiculous. but it gets worse. much, much worse... explore other books, and you'll see many have a file named "00000147.tif". for instance, here's one, this time from "books and culture", by mabie: > http://mdp.lib.umich.edu/cgi/m/mdp/p...881628&seq=147 (this is actually page 141, so its filename isn't bonded to its content either.) indeed, _every_ book that has more than 147 pages (counting frontmatter) will have a file named "00000147.tif". if you've got lots of books -- and umichigan has, quite literally, _millions_ -- giving a file from each one the name of "00000147.tif" is outright stupid... that means they have to depend on the _folders_ (or _subdirectories_) to tell them apart, which is a very bad accident just waiting to happen. if files get written to the wrong directory, how would you ever know? if a foldername accidently gets changed, how would you ever know? only when some user comes and says, "hey, i was supposed to get "my antonia" by willa cather, but i got "books and culture" by mabie, so what's up with that?" cyberspace libraries need to follow some common-sense rules: 1. a file's name _must_ identify the content it contains, unequivocally. 2. the same file _should_ have the same name, no matter where it is. 3. different files _must_ have different names (cannot have the same name). there's more rules than that, but let's just stick with those for now. but, as we've seen above, even a huge and sophisticated library like the university of michigan cannot get even this simple thing right... it's sad, i tell you, it's really sad... on the bright side, however, once you follow this simple naming convention, all of a sudden every document in your library has a unique name that can be linked to a matching unique u.r.l., and referencing just became very simple... for example, in the case of those 50,000 letters, i might give them names that would correspond to their dates, or maybe their sender or recipient, or some combination. (it's hard to say without first perusing their content.) but whatever the names i gave them, they would then bond to their web u.r.l. and this, ultimately, is the way that you can deal with "unnumbered pages". you _give_ them a number, or a name, and that name becomes their u.r.l. -bowerbird