What "Cleaning Up" Do Project Gutenberg Texts Need [closed] - Page 6

bowerbird · 11-05-2007, 02:04 AM

concrete points? i guess i missed them. at any rate, the proof is in the pudding.
if you're right, my library won't work. so there's no point to any discussion here.

so, as they say, have a nice day... :+)

-bowerbird

bowerbird · 11-05-2007, 02:30 AM

gee, it doesn't appear i've posted all the messages
that i've written. nonetheless, i'm sure it will seem
like i didn't address the "concrete points" anyway.

still, i'll send those messages some time.
maybe tomorrow. maybe the day after...
but we did enough back-and-forth today.

-bowerbird

kovidgoyal · 11-05-2007, 02:32 AM

It feels nice to win an argument. You do bring out the child in me :-)

bowerbird · 11-05-2007, 02:36 AM

i'm glad you feel that you won.

maybe it'll mean you back off...

-bowerbird

kovidgoyal · 11-05-2007, 03:26 AM

I actually meant that as an explanation for why I was being so insistent, not a declaration of victory. I'm still looking forward to what you have to say in response to my last post.

Panurge · 11-05-2007, 11:48 PM

[For XHTML markup, one thing that comes to mind (just off the top of my head) would be to enclose all the text that makes up an original page with a surrounding tag that uses the "id" attribute to hold the page number. This would not display, but could be accessed if needed. Also, by using "id", you could construct a special hyperlinked table of pages that would allow you to jump to specific pages in the ebook. I'll have to try this and see how it works.]

Some such solution might satisfy everyone. Current scholarly journal databases such as Project Muse give the page numbers in square brackets within the text--an "ugly" solution, I suppose, but a simple one. JSTOR, the dominant archive of scholarly journals takes a different tack. It uses searchable PDF files and presents a scanned graphic representation of the original journal page, so the pagination problem is not an issue. However, the downloaded PDFs don't look all that great on the Sony Reader, though they are usable.
Sorry to have caught up with the conversation so late; I don't get a chance to log on to the forums every day.

jbenny · 11-06-2007, 12:41 PM

Quote:

Originally Posted by Panurge

Current scholarly journal databases such as Project Muse give the page numbers in square brackets within the text--an "ugly" solution, I suppose, but a simple one. JSTOR, the dominant archive of scholarly journals takes a different tack. It uses searchable PDF files and presents a scanned graphic representation of the original journal page, so the pagination problem is not an issue. However, the downloaded PDFs don't look all that great on the Sony Reader, though they are usable.

Although neither is ideal, both methods could easily be done in an epub ebook. The first would be very simple, but "ugly" as you say. Including a scanned image of each page (PDF, PNG, JPG, etc.) that is linked from the XHTML text is also possible. This would of course make the epub much larger and more work to construct.

I haven't had the time to think about other ways to do this, but there is probably a good way to do this strictly in XHTML, without having to include scans or put visible page numbers in the text. Perhaps someone else can suggest something?

BTW, this may be a good topic to split out into its own thread.

Edit: Nevermind. I'll create a new topic for it myself.

bowerbird · 11-06-2007, 12:44 PM

panurge, great to have you back. i was worried that
the temperature in here had driven you away... :+)

at any rate, i wrote another message on pagenumbers,
and will go dig it up to post it shortly...

in the meantime, here is a quick summary of various
projects of mine -- in various states of polish -- which
are available in some form online or by-request...

perhaps this will give people an idea of my scope...

i invite the skeptics to go find the flaws in my work,
and report them in great detail... ;+)

-bowerbird

================================================== ====
the proof is in the pudding.
================================================== ====

for the latest version of this pudding sampler at any time, please visit:
> http://z-m-l.com/go/pudding_sampler.html

================================================== ====
the z.m.l. tool-chain is now starting to cohere across the workflow,
so here's a reminder about the pudding samples available currently.
all of these are in-progress, so constructive criticism is welcomed...
================================================== ====

babelfish -- prototype web-app viewer-program for z.m.l.
> http://z-m-l.com/go/babelfish19.pl

verylovely -- canned online zml-to-html conversion demo
> http://www.z-m-l.com/go/vl3.pl

zmldingus -- live online zml-to-html conversion app
> http://www.z-m-l.com/go/zmldingus093.pl

"continuous proofreading" mode: various sample books
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/tolbk/tolbkp001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html
> http://z-m-l.com/go/ahmmw/ahmmwp001.html
> http://z-m-l.com/go/goann/goannc001.html

.pdf samples -- sample of the zml-to-pdf conversion process
> http://z-m-l.com/oyayr/oyayr.zml
> http://z-m-l.com/oyayr/oya-sunday.pdf
> http://snowy.arsc.alaska.edu/bowerbi...01/alice01.zml
> http://snowy.arsc.alaska.edu/bowerbi...1/alice01b.pdf

.html samples -- sample of the zml-to-html conversion process
> http://snowy.arsc.alaska.edu/bowerbi...01/alice01.zml
> http://snowy.arsc.alaska.edu/bowerbi...1/alice01.html

show_scan-set -- web-viewer modified specifically for viewing otherwise-raw scan-sets
> http://z-m-l.com/go/sss.pl

iphone -- web-viewer modified specifically for the iphone
> http://z-m-l.com/go/babelfishi20.pl

iphone -- reading a scan-set (e.g., page images) on the iphone
> http://z-m-l.com/go/babelfishi20.pl

give -- cross-platform offline viewer-program for z.m.l. (dated now, but...)
> download from the "zml-talk" group at yahoogroups

zandbox -- cross-platform offline z.m.l. authoring-tool
> e-mail me for a copy

banana cream -- cross-platform offline proofreading engine
> e-mail me for a copy

scrape/clean -- cross-platform offline proofreading engine
> e-mail me for a copy

-bowerbird
================================================== ====
the proof is in the pudding.
================================================== ====

JSWolf · 11-06-2007, 01:05 PM

How is ZML useful to get a ZML marked up text into LRF and PRC formats so we can read them on our 505s and Gen3s/iLiads?

bowerbird · 11-06-2007, 01:42 PM

jon, right now, it's not. very shortly, however, the .html conversion will be
solid enough for you to use as the rosetta-stone to leapfrog to other formats.

-bowerbird

bowerbird · 11-06-2007, 02:17 PM

jbenny said:
> You bring up a very valid point that most of us don't think of
> (me included). Can you suggest a way to handle this
> without having the page numbers in-line with the text?
> Most of us would find the visible page numbers too obnoxious.
> For XHTML markup, one thing that comes to mind
> (just off the top of my head) would be to enclose
> all the text that makes up an original page with
> a surrounding tag that uses the "id" attribute
> to hold the page number

i admire the initiative that makes you jump in on this
problem that you haven't really thought about before.

a 3.2k lorem ipsum example isn't really needed, though.

many other people _have_ thought about it, for a while,
so a little exploratory research can go a long way here...
as they've already made a pass at providing solutions...

i've described mine -- and will repeat the links here --
> http://z-m-l.com/go/myant/myantp001.html
> http://z-m-l.com/go/mabie/mabiep001.html
> http://z-m-l.com/go/sgfhb/sgfhbp001.html
> http://z-m-l.com/go/tolbk/tolbkp001.html
> http://z-m-l.com/go/goann/goannp001.html
these demo e-books let you link directly to _one_page_,
where the text is available in easily-copied digital form,
and the page-scan is presented for reference as well...
a comment-form at the bottom lets people report errors,
or even make annotations to the page for others to see...

and again, these are all being done with my .zml format.
you can view the .zml files underlying the above books:
> http://z-m-l.com/go/myant/myant.zml
> http://z-m-l.com/go/mabie/mabie.zml
> http://z-m-l.com/go/sgfhb/sgfhb.zml
> http://z-m-l.com/go/tolbk/tolbk.zml
> http://z-m-l.com/go/goann/goann.zml

so, in spite of the people who would like to convince you
otherwise, here's some pudding as proof that light-markup
is quite capable of generating an e-book that works well...

so that's _my_ particular take on pagenumber retention...

***

i can point to other work too, and i am happy to do so...

i might as well start at the top, with la creme de la creme.

jose menedez has created "digital reprints" which _rock_.

you can download one here:
> http://www.ibiblio.org/ebooks/Einste...Relativity.pdf

that .pdf might _look_ unremarkable, upon first viewing,
but you'll find that the pagenumbers are actually _links_
that will open up the _page-scan_ for that specific page.

originally they opened up the exact page in the scan-set
at google, but it seems google changed their interface,
and now jose's nice links merely go to the first page.
there's a lesson there against depending on other sites...

so, as a more convenient option, you can use my scans.
using the number actually printed on the original page,
plug it into the following u.r.l. template to see the scan:
> http://z-m-l.com/go/einst/einstp001.jpg
in place of the "001", put the page you want. for example:
> http://z-m-l.com/go/einst/einstp089.jpg
will pull up the page-scan for page 89 from the p-book...

if you closely examine any page-scan, you'll observe that
jose's .pdf page is a very accurate replica of that page-scan.
the linebreaks are retained, down to end-line hyphenates.
the leading is almost exactly the same. so are the margins.
jose is an obsessive-compulsive guy; he gets the details right.

here's another digital reprint, this time geronimo's life story:
> http://www.ibiblio.org/ebooks/Geronimo/GerStory.pdf
compare any .pdf page with its scan by using this template:
> http://z-m-l.com/go/geron/geronp001.jpg
(as before, replace "001" with the page-number you want.)
by the way, google's scan-set from this book is the _worst_
job of scanning a book that i have ever seen from them...
it's worth downloading just for its humor as a bad example.

and finally, here's a third from jose, willa cather's "my antonia":
> http://www.ibiblio.org/ebooks/Cather...ia/Antonia.pdf
again, you can see the pagescan for any page on my site:
> http://z-m-l.com/go/myant/myantp001.jpg
(as before, replace "001" with the page-number you want.)

for the first two digital reprints, you can step through the
scan-sets more easily using my "show scan-set" viewer:
> http://z-m-l.com/go/sss.pl
"geronimo's story" is the one that comes up by default,
but you can choose the einstein book or the cather book
with the book-selection menu you will find on the page...
(and "my antonia" was also listed above in my examples.)

the quality of each of jose's "digital reprints", as a reprint,
is fantastic. you immediately see the pages are immensely
cleaner than the scans of those old library books, some of
which were subjected to careless markings by borrowers
who evidently were never taught to respect library books.
(then again, i guess that, over the course of 100 years,
there's gonna be _one_ borrower who simply _forgets_
that this was a library book, and not one of his own books.)

jose's tremendous quality gets _more_ remarkable as we
realize the digital reprint -- as opposed to the scan-set --
is _digital_text_, and thus can be _searched_ and _copied_,
meaning that it's infinitely more flexible than the scan-set.

and this all becomes truly mind-boggling when you further
realize the .pdf is 10-30 times _smaller_ than the scan-set,
which means it will run faster and use far fewer resources...

and yes, it takes some work to convert a scan-set into
digital text -- o.c.r. and proofing and formatting -- but
considering the huge benefits that result, it's worth it.

this, truly, is the direction our digital library should follow...

store a copy of the scans online, so people can refer to 'em,
to confirm for themselves that the digitization was accurate.
but give them, for their actual use, a file that's _digital_text_
-- for maximal convenience in our 21st-century cyberspace --
yet is capable of _replicating_ the original p-book _exactly_,
for the scholar-valued touchstone with previous centuries...

(that doesn't mean we have to _leave_ it in that form; we can
always remix it to our customization if we want to, since that
_remixing_ is part of the magic of a _digital_text_... but still,
we know if we want to replicate the p-book exactly, we can.
and there are times when we _do_ want exact replications...
it makes it much easier to know we're all on the same page.
sorry, but i can't ever resist throwing in that good old cliche.)

indeed, the biggest thing wrong with jose's digital reprints
is the reliance on .pdf, which is the "roach motel" of formats.
(that is, documents can go in, but they cannot come out...)

another problem is that jose builds his files using ms-word,
and doesn't make that original file available for us to remix.

in spite of these faults, though, jose's work is outstanding...

(and, just to connect the dots for you, my z.m.l. work is
designed to give the benefits while overcoming the faults.)

***

there's been other work done on retaining pagenumbers too.

here's yet another version of our good old standby, "my antonia",
which uses an x.m.l. approach to store pagenumber information:
> http://www.openreader.org/myantonia/...myantonia.html

by the way, this is the strategy that led me to make point #14
about not putting pagenumbers in-line inside the body-text...

but, on the _positive_ side, note that this document also
allows a person to click out to view each scan for reference.

also of interest, although i'd hope this degree of markup
becomes unnecessary in the future, with better browsers,
observe that each paragraph has its own "i.d." reference,
thus allowing a link to be made to a specific _paragraph_...

(should we next expect an i.d. reference on every _word_?)

***

and last but not least, because they've actually done _the_most_
work on retaining pagenumber information, you need to look at
the .html versions of the books _distributed_proofreaders_ does
for project gutenberg. over the course of the last couple of years,
most of the postprocessors there have moved to the position that
they believe pagenumbers _should_ be both saved and displayed,
so nearly all of the .html versions posted to p.g. lately have them...

unfortunately, the p.g. version of "my antonia" does not have an
.html version -- sad, the absence of automatic conversion, eh?,
perhaps someone could use gutenmark to make one for them --
so we can't compare their version of it straight across the board...

so let's take p.g. e-text #22222, as a demo, to pick a fun number:
> http://www.gutenberg.org/files/22222...-h/22222-h.htm

you'll see that, yes indeed, they've retained the pagenumber info.
and, unlike the x.m.l. example above, they have used their c.s.s.
to move the pagenumber out into the margin, and turned it gray,
so it's less conspicuous and distracting. so those are good moves.

moreover, if you really want a very good idea of exactly where the
pagebreak occurred, you can drag your cursor across the line and
observe exactly where in the line the pagenumber gets highlighted.
for example, if you scroll down to page 20, and do this little trick,
you'll find the pagebreak occurs between "practitioners" and "is".

(you could "view source" if you want, of course, but that's clumsy.)

what that _means_ is that -- in spite of where it is being displayed --
the pagenumber actually exists in-line, right in the body of the text.

unfortunately, what _that_ means is that, when you _copy_ the text,
the pagenumbers are mixed in, which we already said is a bad thing.

for instance, if you copy out the text around pagebreak 20, you get:
> and although applied to all graduate medical practitioners [20]is,
> in all other realms of learning, a degree awarded for graduate work
eewh! see that pagenumber in the middle? that's not what we want!

however, the problem isn't limited to a hassle when doing remixing.
these pagenumbers intermingled in the actual body-text can _also_
cause problems when the end-user performs a _search_ on the text.

so, for instance, if you do a search for "practitioners is", you will _not_
get a hit on that sentence that straddles page 20, because there is a
pagenumber between those two words.

(ironically, if you search for "practitioners [20]is", you _do_ get a hit;
but of course if you knew that that text is at pagebreak 20, then you
didn't need to search for it, did you? you'd just go right to page 20.)

i googled to see if a search on "practitioners is" would
bring up the .html version of e-text #22222. it didn't.
but more experimentation revealed that i couldn't do
_anything_ to fetch the .html version. the .txt version
came up just fine. but no search would find the .html...
so that's a mystery to me...

these twin usability problems aren't _showstoppers_, but they _are_
"glitches" that should be cleared up, if someone has an idea _how_...
if you are that someone, hustle over to d.p. and help them out, ok?

***

at any rate, here we have some ways to give scholars pagenumbers...

if you have any feedback on any of these systems, i'd love to hear it...

-bowerbird

bowerbird · 11-06-2007, 02:28 PM

in that x.m.l.-based version of "my antonia" i discussed above,
i forgot to provide an example of a link direct to a paragraph.

here's one:
> http://www.openreader.org/myantonia/...nia.html#p0251
you should read the paragraph directly after that one as well...

-bowerbird

kovidgoyal · 11-06-2007, 02:32 PM

Quote:

Originally Posted by bowerbird

jbenny said:
so, in spite of the people who would like to convince you
otherwise, here's some pudding as proof that light-markup
is quite capable of generating an e-book that works well...

Nobody says that lightweight markup cannot generate *an* ebook that "works well". The question is whether lightweight markup is suitable for *all* ebooks. A question you have still failed to address.

bowerbird · 11-06-2007, 02:40 PM

i expect to handle 99% of the books in the p.g. library.

and handle them well. indeed, i expect my viewer-app
will give performance that is surpassed by _no_ others,
and which is _far_superior_ to most... of course, i also
hope those other viewers improve, to the point where
they are no longer surpassed by my app, or any other.
the world of e-books only suffers when viewers are bad...

-bowerbird

bowerbird · 11-06-2007, 02:44 PM

kovidgoyal, i have substantial replies to your previous posts,
which i would like to post, but i don't want to _monopolize_
the conversation here. i'd like to give other people a chance.
when two people overtake a thread, it can get boring fast...

so if you resist the urge to address every point right away,
it would be good. i promise you'll have lots of chances later.

-bowerbird

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The "Closed Circle" is open for business	pholy	Deals and Resources (No Self-Promotion or Affiliate Links)	0	12-20-2009 09:24 PM
"SuperBook" project - British School studies e-books usage	TadW	News	2	06-28-2007 10:46 PM
Introducing the book: Gutenberg offers "in-home" tech support (humor)	nekokami	Lounge	1	05-07-2007 08:40 PM
"Gutenberg 2.0: le futur du livre" / iRex demoes Mobipocket on iLiad	Hadrien	News	4	03-27-2007 11:45 AM

11-05-2007, 02:04 AM	#76
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	concrete points? i guess i missed them. at any rate, the proof is in the pudding. if you're right, my library won't work. so there's no point to any discussion here. so, as they say, have a nice day... :+) -bowerbird

11-05-2007, 02:30 AM	#77
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	gee, it doesn't appear i've posted all the messages that i've written. nonetheless, i'm sure it will seem like i didn't address the "concrete points" anyway. still, i'll send those messages some time. maybe tomorrow. maybe the day after... but we did enough back-and-forth today. -bowerbird

11-05-2007, 02:32 AM	#78
kovidgoyal creator of calibre Posts: 46,357 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It feels nice to win an argument. You do bring out the child in me :-)

11-05-2007, 02:36 AM	#79
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	i'm glad you feel that you won. maybe it'll mean you back off... -bowerbird

11-05-2007, 03:26 AM	#80
kovidgoyal creator of calibre Posts: 46,357 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I actually meant that as an explanation for why I was being so insistent, not a declaration of victory. I'm still looking forward to what you have to say in response to my last post.

11-05-2007, 11:48 PM	#81
Panurge Enthusiast Posts: 34 Karma: 336 Join Date: Dec 2006 Location: Texas Device: Sony Reader	[For XHTML markup, one thing that comes to mind (just off the top of my head) would be to enclose all the text that makes up an original page with a surrounding tag that uses the "id" attribute to hold the page number. This would not display, but could be accessed if needed. Also, by using "id", you could construct a special hyperlinked table of pages that would allow you to jump to specific pages in the ebook. I'll have to try this and see how it works.] Some such solution might satisfy everyone. Current scholarly journal databases such as Project Muse give the page numbers in square brackets within the text--an "ugly" solution, I suppose, but a simple one. JSTOR, the dominant archive of scholarly journals takes a different tack. It uses searchable PDF files and presents a scanned graphic representation of the original journal page, so the pagination problem is not an issue. However, the downloaded PDFs don't look all that great on the Sony Reader, though they are usable. Sorry to have caught up with the conversation so late; I don't get a chance to log on to the forums every day.

11-06-2007, 12:44 PM	#83
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	panurge, great to have you back. i was worried that the temperature in here had driven you away... :+) at any rate, i wrote another message on pagenumbers, and will go dig it up to post it shortly... in the meantime, here is a quick summary of various projects of mine -- in various states of polish -- which are available in some form online or by-request... perhaps this will give people an idea of my scope... i invite the skeptics to go find the flaws in my work, and report them in great detail... ;+) -bowerbird ================================================== ==== the proof is in the pudding. ================================================== ==== for the latest version of this pudding sampler at any time, please visit: > http://z-m-l.com/go/pudding_sampler.html ================================================== ==== the z.m.l. tool-chain is now starting to cohere across the workflow, so here's a reminder about the pudding samples available currently. all of these are in-progress, so constructive criticism is welcomed... ================================================== ==== babelfish -- prototype web-app viewer-program for z.m.l. > http://z-m-l.com/go/babelfish19.pl verylovely -- canned online zml-to-html conversion demo > http://www.z-m-l.com/go/vl3.pl zmldingus -- live online zml-to-html conversion app > http://www.z-m-l.com/go/zmldingus093.pl "continuous proofreading" mode: various sample books > http://z-m-l.com/go/myant/myantp001.html > http://z-m-l.com/go/mabie/mabiep001.html > http://z-m-l.com/go/tolbk/tolbkp001.html > http://z-m-l.com/go/sgfhb/sgfhbp001.html > http://z-m-l.com/go/ahmmw/ahmmwp001.html > http://z-m-l.com/go/goann/goannc001.html .pdf samples -- sample of the zml-to-pdf conversion process > http://z-m-l.com/oyayr/oyayr.zml > http://z-m-l.com/oyayr/oya-sunday.pdf > http://snowy.arsc.alaska.edu/bowerbi...01/alice01.zml > http://snowy.arsc.alaska.edu/bowerbi...1/alice01b.pdf .html samples -- sample of the zml-to-html conversion process > http://snowy.arsc.alaska.edu/bowerbi...01/alice01.zml > http://snowy.arsc.alaska.edu/bowerbi...1/alice01.html show_scan-set -- web-viewer modified specifically for viewing otherwise-raw scan-sets > http://z-m-l.com/go/sss.pl iphone -- web-viewer modified specifically for the iphone > http://z-m-l.com/go/babelfishi20.pl iphone -- reading a scan-set (e.g., page images) on the iphone > http://z-m-l.com/go/babelfishi20.pl give -- cross-platform offline viewer-program for z.m.l. (dated now, but...) > download from the "zml-talk" group at yahoogroups zandbox -- cross-platform offline z.m.l. authoring-tool > e-mail me for a copy banana cream -- cross-platform offline proofreading engine > e-mail me for a copy scrape/clean -- cross-platform offline proofreading engine > e-mail me for a copy -bowerbird ================================================== ==== the proof is in the pudding. ================================================== ====

11-06-2007, 01:05 PM	#84
JSWolf Resident Curmudgeon Posts: 84,010 Karma: 153695583 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	How is ZML useful to get a ZML marked up text into LRF and PRC formats so we can read them on our 505s and Gen3s/iLiads?

11-06-2007, 01:42 PM	#85
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	jon, right now, it's not. very shortly, however, the .html conversion will be solid enough for you to use as the rosetta-stone to leapfrog to other formats. -bowerbird

11-06-2007, 02:17 PM	#86
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	jbenny said: > You bring up a very valid point that most of us don't think of > (me included). Can you suggest a way to handle this > without having the page numbers in-line with the text? > Most of us would find the visible page numbers too obnoxious. > For XHTML markup, one thing that comes to mind > (just off the top of my head) would be to enclose > all the text that makes up an original page with > a surrounding tag that uses the "id" attribute > to hold the page number i admire the initiative that makes you jump in on this problem that you haven't really thought about before. a 3.2k lorem ipsum example isn't really needed, though. many other people _have_ thought about it, for a while, so a little exploratory research can go a long way here... as they've already made a pass at providing solutions... i've described mine -- and will repeat the links here -- > http://z-m-l.com/go/myant/myantp001.html > http://z-m-l.com/go/mabie/mabiep001.html > http://z-m-l.com/go/sgfhb/sgfhbp001.html > http://z-m-l.com/go/tolbk/tolbkp001.html > http://z-m-l.com/go/goann/goannp001.html these demo e-books let you link directly to _one_page_, where the text is available in easily-copied digital form, and the page-scan is presented for reference as well... a comment-form at the bottom lets people report errors, or even make annotations to the page for others to see... and again, these are all being done with my .zml format. you can view the .zml files underlying the above books: > http://z-m-l.com/go/myant/myant.zml > http://z-m-l.com/go/mabie/mabie.zml > http://z-m-l.com/go/sgfhb/sgfhb.zml > http://z-m-l.com/go/tolbk/tolbk.zml > http://z-m-l.com/go/goann/goann.zml so, in spite of the people who would like to convince you otherwise, here's some pudding as proof that light-markup is quite capable of generating an e-book that works well... so that's _my_ particular take on pagenumber retention... * i can point to other work too, and i am happy to do so... i might as well start at the top, with la creme de la creme. jose menedez has created "digital reprints" which _rock_. you can download one here: > http://www.ibiblio.org/ebooks/Einste...Relativity.pdf that .pdf might _look_ unremarkable, upon first viewing, but you'll find that the pagenumbers are actually _links_ that will open up the _page-scan_ for that specific page. originally they opened up the exact page in the scan-set at google, but it seems google changed their interface, and now jose's nice links merely go to the first page. there's a lesson there against depending on other sites... so, as a more convenient option, you can use my scans. using the number actually printed on the original page, plug it into the following u.r.l. template to see the scan: > http://z-m-l.com/go/einst/einstp001.jpg in place of the "001", put the page you want. for example: > http://z-m-l.com/go/einst/einstp089.jpg will pull up the page-scan for page 89 from the p-book... if you closely examine any page-scan, you'll observe that jose's .pdf page is a very accurate replica of that page-scan. the linebreaks are retained, down to end-line hyphenates. the leading is almost exactly the same. so are the margins. jose is an obsessive-compulsive guy; he gets the details right. here's another digital reprint, this time geronimo's life story: > http://www.ibiblio.org/ebooks/Geronimo/GerStory.pdf compare any .pdf page with its scan by using this template: > http://z-m-l.com/go/geron/geronp001.jpg (as before, replace "001" with the page-number you want.) by the way, google's scan-set from this book is the _worst_ job of scanning a book that i have ever seen from them... it's worth downloading just for its humor as a bad example. and finally, here's a third from jose, willa cather's "my antonia": > http://www.ibiblio.org/ebooks/Cather...ia/Antonia.pdf again, you can see the pagescan for any page on my site: > http://z-m-l.com/go/myant/myantp001.jpg (as before, replace "001" with the page-number you want.) for the first two digital reprints, you can step through the scan-sets more easily using my "show scan-set" viewer: > http://z-m-l.com/go/sss.pl "geronimo's story" is the one that comes up by default, but you can choose the einstein book or the cather book with the book-selection menu you will find on the page... (and "my antonia" was also listed above in my examples.) the quality of each of jose's "digital reprints", as a reprint, is fantastic. you immediately see the pages are immensely cleaner than the scans of those old library books, some of which were subjected to careless markings by borrowers who evidently were never taught to respect library books. (then again, i guess that, over the course of 100 years, there's gonna be _one_ borrower who simply _forgets_ that this was a library book, and not one of his own books.) jose's tremendous quality gets _more_ remarkable as we realize the digital reprint -- as opposed to the scan-set -- is _digital_text_, and thus can be _searched_ and _copied_, meaning that it's infinitely more flexible than the scan-set. and this all becomes truly mind-boggling when you further realize the .pdf is 10-30 times _smaller_ than the scan-set, which means it will run faster and use far fewer resources... and yes, it takes some work to convert a scan-set into digital text -- o.c.r. and proofing and formatting -- but considering the huge benefits that result, it's worth it. this, truly, is the direction our digital library should follow... store a copy of the scans online, so people can refer to 'em, to confirm for themselves that the digitization was accurate. but give them, for their actual use, a file that's _digital_text_ -- for maximal convenience in our 21st-century cyberspace -- yet is capable of _replicating_ the original p-book _exactly_, for the scholar-valued touchstone with previous centuries... (that doesn't mean we have to _leave_ it in that form; we can always remix it to our customization if we want to, since that _remixing_ is part of the magic of a _digital_text_... but still, we know if we want to replicate the p-book exactly, we can. and there are times when we _do_ want exact replications... it makes it much easier to know we're all on the same page. sorry, but i can't ever resist throwing in that good old cliche.) indeed, the biggest thing wrong with jose's digital reprints is the reliance on .pdf, which is the "roach motel" of formats. (that is, documents can go in, but they cannot come out...) another problem is that jose builds his files using ms-word, and doesn't make that original file available for us to remix. in spite of these faults, though, jose's work is outstanding... (and, just to connect the dots for you, my z.m.l. work is designed to give the benefits while overcoming the faults.) * there's been other work done on retaining pagenumbers too. here's yet another version of our good old standby, "my antonia", which uses an x.m.l. approach to store pagenumber information: > http://www.openreader.org/myantonia/...myantonia.html by the way, this is the strategy that led me to make point #14 about not putting pagenumbers in-line inside the body-text... but, on the _positive_ side, note that this document also allows a person to click out to view each scan for reference. also of interest, although i'd hope this degree of markup becomes unnecessary in the future, with better browsers, observe that each paragraph has its own "i.d." reference, thus allowing a link to be made to a specific _paragraph_... (should we next expect an i.d. reference on every _word_?) * and last but not least, because they've actually done _the_most_ work on retaining pagenumber information, you need to look at the .html versions of the books _distributed_proofreaders_ does for project gutenberg. over the course of the last couple of years, most of the postprocessors there have moved to the position that they believe pagenumbers _should_ be both saved and displayed, so nearly all of the .html versions posted to p.g. lately have them... unfortunately, the p.g. version of "my antonia" does not have an .html version -- sad, the absence of automatic conversion, eh?, perhaps someone could use gutenmark to make one for them -- so we can't compare their version of it straight across the board... so let's take p.g. e-text #22222, as a demo, to pick a fun number: > http://www.gutenberg.org/files/22222...-h/22222-h.htm you'll see that, yes indeed, they've retained the pagenumber info. and, unlike the x.m.l. example above, they have used their c.s.s. to move the pagenumber out into the margin, and turned it gray, so it's less conspicuous and distracting. so those are good moves. moreover, if you really want a very good idea of exactly where the pagebreak occurred, you can drag your cursor across the line and observe exactly where in the line the pagenumber gets highlighted. for example, if you scroll down to page 20, and do this little trick, you'll find the pagebreak occurs between "practitioners" and "is". (you could "view source" if you want, of course, but that's clumsy.) what that _means_ is that -- in spite of where it is being displayed -- the pagenumber actually exists in-line, right in the body of the text. unfortunately, what _that_ means is that, when you _copy_ the text, the pagenumbers are mixed in, which we already said is a bad thing. for instance, if you copy out the text around pagebreak 20, you get: > and although applied to all graduate medical practitioners [20]is, > in all other realms of learning, a degree awarded for graduate work eewh! see that pagenumber in the middle? that's not what we want! however, the problem isn't limited to a hassle when doing remixing. these pagenumbers intermingled in the actual body-text can _also_ cause problems when the end-user performs a _search_ on the text. so, for instance, if you do a search for "practitioners is", you will _not_ get a hit on that sentence that straddles page 20, because there is a pagenumber between those two words. (ironically, if you search for "practitioners [20]is", you _do_ get a hit; but of course if you knew that that text is at pagebreak 20, then you didn't need to search for it, did you? you'd just go right to page 20.) i googled to see if a search on "practitioners is" would bring up the .html version of e-text #22222. it didn't. but more experimentation revealed that i couldn't do _anything_ to fetch the .html version. the .txt version came up just fine. but no search would find the .html... so that's a mystery to me... these twin usability problems aren't _showstoppers_, but they _are_ "glitches" that should be cleared up, if someone has an idea _how_... if you are that someone, hustle over to d.p. and help them out, ok? * at any rate, here we have some ways to give scholars pagenumbers... if you have any feedback on any of these systems, i'd love to hear it... -bowerbird

11-06-2007, 02:28 PM	#87
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	in that x.m.l.-based version of "my antonia" i discussed above, i forgot to provide an example of a link direct to a paragraph. here's one: > http://www.openreader.org/myantonia/...nia.html#p0251 you should read the paragraph directly after that one as well... -bowerbird

11-06-2007, 02:40 PM	#89
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	i expect to handle 99% of the books in the p.g. library. and handle them well. indeed, i expect my viewer-app will give performance that is surpassed by _no_ others, and which is _far_superior_ to most... of course, i also hope those other viewers improve, to the point where they are no longer surpassed by my app, or any other. the world of e-books only suffers when viewers are bad... -bowerbird

11-06-2007, 02:44 PM	#90
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	kovidgoyal, i have substantial replies to your previous posts, which i would like to post, but i don't want to _monopolize_ the conversation here. i'd like to give other people a chance. when two people overtake a thread, it can get boring fast... so if you resist the urge to address every point right away, it would be good. i promise you'll have lots of chances later. -bowerbird