Page numbers in ebooks for scholarly research? - Page 4

DaleDe · 11-06-2007, 08:24 PM

Quote:

Originally Posted by sartori

I've been playing around with representing print versions online as faithfully as possible see sample. Unfortunately I can't see any way this would translate into a reflowable page size.

(This is just a sample and was more of an experiment to see how it could be done)

Really nice work at making it look like a book. Very close to a PDF. It would translate to a smaller page just fine but, of course, would not look the same. The text would all wrap differently and the TOC would have to be formatted a little different. There is nothing magic about a particular page size except that we get used to looking at it in that size. If you first saw this document formatted for a 6x9 paper back book then you would likely think that was how it would always look.

Dale

sartori · 11-06-2007, 08:26 PM

Those pages I added were time consuming but mainly because I was figuring out the layout. I do plan on working through the whole book but I haven't found a plain text version available so I am ocr'ing the pdf from archive.org. This is currently the slowest part as I am proofing and converting quotes and dashes over.

Right now it's more the challenge on seeing how it could be done and figuring out any of the quirks that may crop up.

For example, if you increase the display font size in your browser, the pages expand lengthwise to accommodate it. It just runs into problems with items that are specifically positioned, such as the table of contents. I think I'll continue playing with this and see what I can come up with.

kovidgoyal · 11-06-2007, 08:48 PM

Quote:

Originally Posted by sartori

Ok, been playing around with adding paragraph markers to my sample as suggested earlier in this thread. Just a quick question - do any of the current html->lrf converters respect css hidden properties? If so it wouldn't be too hard to created a library of books that display paged as in my example but then you could easily convert them to lrf and ignore page numbers, etc. (It would be time consuming but not difficult).

This could almost become a master library that looks good online for people doing research and referencing certain sections/pages but also great for those who want to just read them on their portable device.

html2lrf will ignore tags that have display=none set

jbenny · 11-06-2007, 08:49 PM

Quote:

Originally Posted by sartori

Those pages I added were time consuming but mainly because I was figuring out the layout. I do plan on working through the whole book but I haven't found a plain text version available so I am ocr'ing the pdf from archive.org. This is currently the slowest part as I am proofing and converting quotes and dashes over.

Right now it's more the challenge on seeing how it could be done and figuring out any of the quirks that may crop up.

For example, if you increase the display font size in your browser, the pages expand lengthwise to accommodate it. It just runs into problems with items that are specifically positioned, such as the table of contents. I think I'll continue playing with this and see what I can come up with.

There is also a PDF copy at Google Books:
http://books.google.com/books?id=j-s...est+literature

They have apparently OCRed the text, as you can "view text" for each individual page. Sadly, the downloadable PDF doesn't include the OCRed text. That would have saved you some effort.

jbenny · 11-06-2007, 08:53 PM

Quote:

Originally Posted by kovidgoyal

html2lrf will ignore tags that have display=none set

That's good to know. Being based on XHTML, epub should also respect the "display=none" attribute. I'll have to see if Digital Editions honors this. The Lector plugin most certainly should.

sartori · 11-06-2007, 08:54 PM

kovidgoyal,

So if I was to create a secondary css file that hides all the page breaks and page numbers and just displays the text with simple formatting (ie justified, centered, different sizes) html2lrf would be able to create a decent looking lrf from the file?

jbenny · 11-06-2007, 08:56 PM

Hey, did you check Gutenberg? I just saw that they have six volumes.

http://www.gutenberg.org/browse/authors/w#a993

sartori · 11-06-2007, 09:00 PM

Quote:

Originally Posted by jbenny

Hey, did you check Gutenberg? I just saw that they have six volumes.

http://www.gutenberg.org/browse/authors/w#a993

Thanks, for that - I just checked those out and they appear to be from a slightly different version than the ones on archive.org (and they have all 31 volumes). As my goal is to represent the printed version, the differences may become a problem with page numbers being different.

jbenny · 11-06-2007, 09:04 PM

Quote:

Originally Posted by sartori

Thanks, for that - I just checked those out and they appear to be from a slightly different version than the ones on archive.org (and they have all 31 volumes). As my goal is to represent the printed version, the differences may become a problem with page numbers being different.

Too bad it is a different version. It would have saved you a lot of work with the OCR part on at least those six volumes.

Well, good luck with the project. What you have so far looks very nice.

Panurge · 11-06-2007, 10:16 PM

I'm rather surprised that my (admittedly minor) point has generated such a discussion, so allow me to make one or two more:
Scholarly citation is meant to serve two main purposes:
1. establish the authority for a reference so that if someone cares to check your accuracy or honesty, the location of the quotation or reference can be pinpointed and verified;
2. provide a context for a quotation or reference so that the reader can understand the total argument or occasion to which it belongs.
I am convinced that electronic forms of delivery will ultimately prevail; if future readers can locate the exact source with ease (perhaps even greater ease than was possible in the print world--hyperlinks, search engines, whatever works), then we don't need page numbers. We do need to know how closely the electronic version resembles its print source.
However, there is sometimes more information in a print or handwritten source than can be easily captured in its digitized version. Medieval manuscripts, an English scholar realized recently, can sometimes be dated and associated more precisely by using DNA information from its parchment (aka, sheepskin) and ink media. Yet, as the digitization of the Beowulf manuscript also showed, high-resolution and other scanning techniques can also reveal aspects of the original that would otherwise be impossible to recognize. When you've got only one copy (like the Beowulf manuscript), you need all the help you can get.
So the original is irreplaceable for the scholar, in many cases, because its verbal content is only part of the information it contains.
Perhaps in the future we will find a way to capture all the information we are likely to need for the foreseeable future, but then there are always surprises, as the identification of parchment provenance using DNA analysis illustrates. At some point we'll simply have to draw the line and admit that we can't do everything; some information will have to be lost. The goal of the user of a particular document will determine if that loss is critical, incidental, or trivial.
For most of us, it won't matter. But for archeologists of the text, it will.

bowerbird · 11-06-2007, 10:26 PM

panurge said:
> then we don't need page numbers.

we still need them, because prior aspects of the record
use them. we cannot forfeit all those earlier pointers...

> We do need to know how closely
> the electronic version resembles its print source.

and, for that, we need to sync the two. by page number.
(because, realistically, what else are we going to use?)

> there is sometimes more information
> in a print or handwritten source
> than can be easily captured in its digitized version.

that's a different problem. but we always had that one.
there's no substitute for access to the original, at least
for some things. still, for a good many _other_ things,
access to a digital copy is better than nothing, _much_
better than we used to have (i.e., which was nothing...)

if you have feedback on the numerous examples i gave,
i'd love to hear it. if not, that's fine too...

-bowerbird

kovidgoyal · 11-06-2007, 10:36 PM

Quote:

Originally Posted by sartori

kovidgoyal,

So if I was to create a secondary css file that hides all the page breaks and page numbers and just displays the text with simple formatting (ie justified, centered, different sizes) html2lrf would be able to create a decent looking lrf from the file?

It wont display the hidden elements. Whether the resulting LRF will look good or not depends on the kind of HTML you use. But I'm always willing to add support for more esoteric HTML to html2lrf, within reason :-)

sartori · 11-06-2007, 10:48 PM

Quote:

Originally Posted by kovidgoyal

It wont display the hidden elements. Whether the resulting LRF will look good or not depends on the kind of HTML you use. But I'm always willing to add support for more esoteric HTML to html2lrf, within reason :-)

Ok, thanks. I think I'll play around with this tomorrow and see if I can come up with a 'plain' css version of the same page.

Panurge · 11-07-2007, 12:06 AM

[> then we don't need page numbers.

we still need them, because prior aspects of the record
use them. we cannot forfeit all those earlier pointers...

> We do need to know how closely
> the electronic version resembles its print source.

and, for that, we need to sync the two. by page number.
(because, realistically, what else are we going to use?)]

Page numbers are simply a way of keeping track of pages. The earliest printed books don't have them. For incunabulae, the books published in the second half of the 15th century, there were numbers, not of pages but of groups of pages, so that when the book was put together for binding the sections would not be out of order. Manuscripts may or may not have page numbers. Sometimes the first word of the following page was printed (or written) at the bottom of the preceding page to establish sequence.
What really counts, for the most part, is textual accuracy--that is, identity of the two texts. For routine purposes, one wouldn't have to refer to the original if the electronic copy were certifiably accurate. But there's the rub, perhaps. When I edit an older text, say an unprinted manuscript, I'm not usually obliged to give its original page numbers. I just need to identify the original source and signal each time I depart from its authority (for example, to correct an obvious error in spelling or printing).
The scholarly world has had many ways of ensuring synchronization between two texts; page numbers are one but not the only one. Of course they are helpful, but historically printers have sometimes ignored them. In the case of Greek and Latin texts, individual passages were identified by paragraph and sentence numbering, and that is still used among classicists today, as was observed above.
So, yes, I agree that page numbers are useful for synchronizing two versions of a text; in the case of verse, however, we go by line numbers and larger divisions or sections of the poem. So the physical page isn't always what matters.
My only intention in bringing up this matter was to point out that digitization of books in the future may not be as simple a matter as we would like and that there is no one solution that will fit some of these odd cases. Nor will past practice always be a reliable guide to what will work in the future. At some point electronic texts will be recognized as the accepted authority, and page numbers will no longer matter; for us, in a time of transition, they still do on occasion, depending on our relationship to what we're reading.

Let me say that as someone who guards, keeps track of, and preserves books from harm, I'm delighted to see such a vigorous discussion about how to address the problem and find solutions. We are in a time of tremendous change that will have at least as much impact on the distribution of information as resulted from the invention of moveable type, and groups like this one are at the forefront because they include not simply programmers and designers but regular readers and enthusiasts who understand the users' needs. More power and glory to them.

Panurge · 11-07-2007, 01:16 AM

Perhaps I should have also said "because they include not simply regular readers and enthusiasts but also programmers and designers." I'm looking forward to examining all the examples that have been posted in this thread as soon as I can get the time to do so.

11-06-2007, 08:26 PM	#47
sartori Connoisseur Posts: 54 Karma: 29 Join Date: Oct 2006	Those pages I added were time consuming but mainly because I was figuring out the layout. I do plan on working through the whole book but I haven't found a plain text version available so I am ocr'ing the pdf from archive.org. This is currently the slowest part as I am proofing and converting quotes and dashes over. Right now it's more the challenge on seeing how it could be done and figuring out any of the quirks that may crop up. For example, if you increase the display font size in your browser, the pages expand lengthwise to accommodate it. It just runs into problems with items that are specifically positioned, such as the table of contents. I think I'll continue playing with this and see what I can come up with. Last edited by sartori; 11-06-2007 at 08:32 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Page numbers	Fincary	Astak EZReader	4	02-18-2010 03:06 PM
page numbers	nenad	Amazon Kindle	2	12-19-2009 09:01 AM
Professional and scholarly ebooks account for 75% of ebook market?	anurag	News	1	11-26-2009 12:40 PM
Page numbers, AGAIN	orlincho	Bookeen	92	08-19-2008 07:15 AM
Page numbers (again)	Prospect	Workshop	50	04-10-2008 02:19 AM

11-06-2007, 08:54 PM	#51
sartori Connoisseur Posts: 54 Karma: 29 Join Date: Oct 2006	kovidgoyal, So if I was to create a secondary css file that hides all the page breaks and page numbers and just displays the text with simple formatting (ie justified, centered, different sizes) html2lrf would be able to create a decent looking lrf from the file?

11-06-2007, 08:56 PM	#52
jbenny Addict Posts: 323 Karma: 358 Join Date: May 2007 Device: Tablet PC and Nokia N800	Hey, did you check Gutenberg? I just saw that they have six volumes. http://www.gutenberg.org/browse/authors/w#a993

11-06-2007, 10:16 PM	#55
Panurge Enthusiast Posts: 34 Karma: 336 Join Date: Dec 2006 Location: Texas Device: Sony Reader	I'm rather surprised that my (admittedly minor) point has generated such a discussion, so allow me to make one or two more: Scholarly citation is meant to serve two main purposes: 1. establish the authority for a reference so that if someone cares to check your accuracy or honesty, the location of the quotation or reference can be pinpointed and verified; 2. provide a context for a quotation or reference so that the reader can understand the total argument or occasion to which it belongs. I am convinced that electronic forms of delivery will ultimately prevail; if future readers can locate the exact source with ease (perhaps even greater ease than was possible in the print world--hyperlinks, search engines, whatever works), then we don't need page numbers. We do need to know how closely the electronic version resembles its print source. However, there is sometimes more information in a print or handwritten source than can be easily captured in its digitized version. Medieval manuscripts, an English scholar realized recently, can sometimes be dated and associated more precisely by using DNA information from its parchment (aka, sheepskin) and ink media. Yet, as the digitization of the Beowulf manuscript also showed, high-resolution and other scanning techniques can also reveal aspects of the original that would otherwise be impossible to recognize. When you've got only one copy (like the Beowulf manuscript), you need all the help you can get. So the original is irreplaceable for the scholar, in many cases, because its verbal content is only part of the information it contains. Perhaps in the future we will find a way to capture all the information we are likely to need for the foreseeable future, but then there are always surprises, as the identification of parchment provenance using DNA analysis illustrates. At some point we'll simply have to draw the line and admit that we can't do everything; some information will have to be lost. The goal of the user of a particular document will determine if that loss is critical, incidental, or trivial. For most of us, it won't matter. But for archeologists of the text, it will.

11-06-2007, 10:26 PM	#56
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	panurge said: > then we don't need page numbers. we still need them, because prior aspects of the record use them. we cannot forfeit all those earlier pointers... > We do need to know how closely > the electronic version resembles its print source. and, for that, we need to sync the two. by page number. (because, realistically, what else are we going to use?) > there is sometimes more information > in a print or handwritten source > than can be easily captured in its digitized version. that's a different problem. but we always had that one. there's no substitute for access to the original, at least for some things. still, for a good many _other_ things, access to a digital copy is better than nothing, _much_ better than we used to have (i.e., which was nothing...) if you have feedback on the numerous examples i gave, i'd love to hear it. if not, that's fine too... -bowerbird

11-07-2007, 12:06 AM	#59
Panurge Enthusiast Posts: 34 Karma: 336 Join Date: Dec 2006 Location: Texas Device: Sony Reader	[> then we don't need page numbers. we still need them, because prior aspects of the record use them. we cannot forfeit all those earlier pointers... > We do need to know how closely > the electronic version resembles its print source. and, for that, we need to sync the two. by page number. (because, realistically, what else are we going to use?)] Page numbers are simply a way of keeping track of pages. The earliest printed books don't have them. For incunabulae, the books published in the second half of the 15th century, there were numbers, not of pages but of groups of pages, so that when the book was put together for binding the sections would not be out of order. Manuscripts may or may not have page numbers. Sometimes the first word of the following page was printed (or written) at the bottom of the preceding page to establish sequence. What really counts, for the most part, is textual accuracy--that is, identity of the two texts. For routine purposes, one wouldn't have to refer to the original if the electronic copy were certifiably accurate. But there's the rub, perhaps. When I edit an older text, say an unprinted manuscript, I'm not usually obliged to give its original page numbers. I just need to identify the original source and signal each time I depart from its authority (for example, to correct an obvious error in spelling or printing). The scholarly world has had many ways of ensuring synchronization between two texts; page numbers are one but not the only one. Of course they are helpful, but historically printers have sometimes ignored them. In the case of Greek and Latin texts, individual passages were identified by paragraph and sentence numbering, and that is still used among classicists today, as was observed above. So, yes, I agree that page numbers are useful for synchronizing two versions of a text; in the case of verse, however, we go by line numbers and larger divisions or sections of the poem. So the physical page isn't always what matters. My only intention in bringing up this matter was to point out that digitization of books in the future may not be as simple a matter as we would like and that there is no one solution that will fit some of these odd cases. Nor will past practice always be a reliable guide to what will work in the future. At some point electronic texts will be recognized as the accepted authority, and page numbers will no longer matter; for us, in a time of transition, they still do on occasion, depending on our relationship to what we're reading. Let me say that as someone who guards, keeps track of, and preserves books from harm, I'm delighted to see such a vigorous discussion about how to address the problem and find solutions. We are in a time of tremendous change that will have at least as much impact on the distribution of information as resulted from the invention of moveable type, and groups like this one are at the forefront because they include not simply programmers and designers but regular readers and enthusiasts who understand the users' needs. More power and glory to them.

11-07-2007, 01:16 AM	#60
Panurge Enthusiast Posts: 34 Karma: 336 Join Date: Dec 2006 Location: Texas Device: Sony Reader	Perhaps I should have also said "because they include not simply regular readers and enthusiasts but also programmers and designers." I'm looking forward to examining all the examples that have been posted in this thread as soon as I can get the time to do so.

Advert

Advert