Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 11-07-2007, 11:09 PM   #76
Panurge
Enthusiast
Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.
 
Panurge's Avatar
 
Posts: 34
Karma: 336
Join Date: Dec 2006
Location: Texas
Device: Sony Reader
Kovidgoyal: [EDIT: An example from physics research articles. A resolution of sections is usually sufficient. i.e. people refer to section so-and-so of paper so and so.
I don't know if that is sufficient resolution in general though.]

Yes, I think that "resolution" is the problem. Paragraph numbers would probably work well for everything but poetry, though in some cases--such as the one you mention--larger units might be more practical. Page numbers work if one can pinpoint the exact edition (publisher, place, date, in addition to title and author) being referenced; that was the contribution of printing. For manuscript copies, logical divisions such as sections or paragraphs or line numbers (for verse) were the only alternative. But are such things needed for electronic documents that can be searched for exact phrases? Presumably not. So long as one can identify the electronic source one is referring to, searching would suffice. But there's the rub. There is no system of cataloguing material that is purely electronic in origin. The URL of a web site, for instance, is an unstable identifier, as we have learnt very quickly in the last decade or so. Printed books have that data, but what kind of unique identifier do electronic documents offer? There's no central clearing house, no Library of Congress or OCLC (the online cataloguing authority for books) or ISBN number as of yet.
When Michael Hart (an academic) started Project Gutenberg, he seems to have encouraged embedded page numbers in ASCII text for the reasons we've already discussed. So far electronic documents are a sort of free-floating, indistinct mass of various kinds of information. Without some standards of granularity or resolution, research will become too unwieldy; the Internet search engine demonstrates the problem all too well.
As a librarian (rather than as a programmer, who finds useful and efficient ways of designing specific solutions), I have to worry about this sort of thing increasingly. Page numbers are, in a manner of speaking, the tip of the iceberg.
Speaking of Google books (which BowerBird mentions above), shouldn't someone point out to them that the scanning is being rather carelessly executed? I keep running into instances of books that are so poorly positioned that part of the text is cut off, to say nothing of the page numbers.
--------------------------------------------------------------------------------
Panurge is offline   Reply With Quote
Old 11-08-2007, 04:17 AM   #77
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
sartori said:
> those pages I added were time consuming
> but mainly because I was figuring out the layout.
> I do plan on working through the whole book
> but I haven't found a plain text version available
> so I am ocr'ing the pdf from archive.org.
> This is currently the slowest part as I am
> proofing and converting quotes and dashes over.

um, gee, you might be missing something very important.

if you got it from archive.org, then it was almost certainly
scanned by the o.c.a., which means that -- right alongside
the .pdf copy -- you should find the o.c.r. they did on it...

i couldn't find volume 1, but some volumes from this series
certainly have their text available. sometimes you need to
click on the "ftp" link to find _all_ of the files that they offer.
if you see nothing labeled as ".txt", seek the "djvu.text" file.

however, in a spectacular display of sheer incompetence,
sometimes the text files are burdened by severe problems,
some of which can even border on fatal. i won't bother to
go into the details here, but check the text _carefully_ first,
before going on to pour work into it, or you might regret it.

so you might well end up doing o.c.r. on the .pdf anyway.
but i'd still suggest you should check out their text first...


> For example, if you increase the display font size
> in your browser, the pages expand lengthwise
> to accommodate it. It just runs into problems with items
> that are specifically positioned, such as the table of contents.

another problem that you need to be aware of -- which might
or might not be something you consider serious -- is when a
paragraph is split across a pagebreak -- as they usually are --
because then the text won't fill out the bottom line of the page,
which is what people expect to see in that situation. the reflow
(will often) end in the middle of the line, the impression is that
the paragraph has ended, which can be disconcerting to people.


> If so it wouldn't be too hard to created a library of books
> that display paged as in my example but then you could
> easily convert them to lrf and ignore page numbers, etc.

but if the page as displayed doesn't fit correctly on the screen,
then you'll have "pagebreaks" occurring mid-screen, correct?
which kind of defeats the whole purpose of a paged display...

***

jbenny said:
> They have apparently OCRed the text

um, well of course google does o.c.r. on the scans.
how else would they be able to do searches on it?


> as you can "view text" for each individual page.

they do that so as to provide access to the visually-impaired.


> Sadly, the downloadable PDF doesn't include the OCRed text.

that's because they don't really want you to have the text.
well, they probably don't care if _you_ have it, but they
don't want all of the _other_ search engines to have it...

***

sartori said:
> I just checked those out and they appear to be from
> a slightly different version than the ones on archive.org
> (and they have all 31 volumes). As my goal is to represent
> the printed version, the differences may become a problem
> with page numbers being different.

so strange. did this series with 31 volumes _really_ go through
several editions? i guess it's not impossible, but it'd suprise me.
are you sure that it's not just _flakiness_ in the p.g. digitization?

because one appealing aspect of the p.g. versions in general is
that they've been subjected to some proofreading, which means
-- if nothing less -- that you can compare them to your output,
because the differences between the two versions will point to
errors in one (or both) of them. indeed, this has provem to be
one of the _most_ effective ways of doing "proofing" on a text...

-bowerbird
bowerbird is offline   Reply With Quote
Advert
Old 11-08-2007, 12:35 PM   #78
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by Panurge View Post
Kovidgoyal:
Yes, I think that "resolution" is the problem. Paragraph numbers would probably work well for everything but poetry, though in some cases--such as the one you mention--larger units might be more practical. Page numbers work if one can pinpoint the exact edition (publisher, place, date, in addition to title and author) being referenced; that was the contribution of printing. For manuscript copies, logical divisions such as sections or paragraphs or line numbers (for verse) were the only alternative. But are such things needed for electronic documents that can be searched for exact phrases? Presumably not. So long as one can identify the electronic source one is referring to, searching would suffice. But there's the rub. There is no system of cataloguing material that is purely electronic in origin. The URL of a web site, for instance, is an unstable identifier, as we have learnt very quickly in the last decade or so. Printed books have that data, but what kind of unique identifier do electronic documents offer? There's no central clearing house, no Library of Congress or OCLC (the online cataloguing authority for books) or ISBN number as of yet.
I think paragraphs work for everything. After all a Stanza of poetry is really a paragraph in effect. The idea is that the logical unit of the author is the paragraph. It is a cohesive thought and can often be determined easily even when the editions change. Paperback vs. hardback makes page number useless again.

you are correct that text can be searched but sometimes there are duplicates if you fail to type enough and searches sometimes fail on word wraps. A specific reference is what is needed when referencing someone else's work.

You also raise a good point about electronic text not being able to be identified. This is particularly interesting when web pages are inherently copyrighted. It would seem that copyrights are not enforceable if you can't produce the original.

Dale
DaleDe is offline   Reply With Quote
Old 11-08-2007, 12:39 PM   #79
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Quote:
Originally Posted by DaleDe View Post
You also raise a good point about electronic text not being able to be identified. This is particularly interesting when web pages are inherently copyrighted. It would seem that copyrights are not enforceable if you can't produce the original.
What about the Internet Archive?
nekokami is offline   Reply With Quote
Old 11-08-2007, 12:46 PM   #80
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by nekokami View Post
What about the Internet Archive?
Are you talking about google? I think it is not guaranteed over the long haul.

Dale
DaleDe is offline   Reply With Quote
Advert
Old 11-08-2007, 01:48 PM   #81
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
panurge said:
> Yes, I think that "resolution" is the problem.
> Paragraph numbers would probably work well
> for everything but poetry, though in some cases
> --such as the one you mention--larger units might be
> more practical.

ok, i'll try one more time.

_nothing_ will work if you don't have stable documents.
_nothing_. so stable documents is a necessary condition.

fortunately, stable documents is also a _sufficient_ condition.
once you have stable documents, just about _any_ system will
work, and work just fine, so you don't need to worry about it...


> Page numbers work if one can pinpoint the exact edition
> (publisher, place, date, in addition to title and author)
> being referenced; that was the contribution of printing.

assuming that you have an infrastructure of stable documents,
the u.r.l. to a document is the "pinpoint" to "the exact edition."

every document points to its "official" u.r.l., so you can compare it
with the document that appears at that u.r.l., and if it is the same,
it hasn't been tampered with, and you know it's a "legitimate" copy.

as with everything else, a system with stable documents makes it
_easy_, whereas it's difficult -- often to the point of impossibility --
in a system without stable documents.


> For manuscript copies, logical divisions such as sections or
> paragraphs or line numbers (for verse) were the only alternative.

and we need to "update" all those archival pointers for the new system.
whatever pointer that one document used to point to another document
needs to be "converted" so the electronic version of the first document
points to the correct place in the electronic version of the second one...


> But are such things needed for electronic documents that
> can be searched for exact phrases? Presumably not.

unless your infrastructure is explicitly using "search" as its methodology,
in which case it's automatic, you don't want to force users to do search
just to activate a pointer. they'll wanna be able to click directly to a point,
and that's a reasonable expectation about a capability we should give 'em.


> So long as one can identify the electronic source one is referring to,
> searching would suffice.

again, the source is unequivocally identified by virtue of its u.r.l.
and even if searching would "suffice", it's not convenient enough.


> But there's the rub. There is no system of cataloguing material
> that is purely electronic in origin. The URL of a web site, for instance,
> is an unstable identifier, as we have learnt very quickly in the last decade

the current system, one which permits unstable documents, won't work.

we need another system -- it could be built on top of the current one --
that has _only_ stable documents in it. this means we can still have the
unstable system -- there's no need to replace it, as it works fine for a
good many purposes -- it just means we have to create another system
that's fully intended to be a permanent archive for dependable reference.

as i said, this stable system could even be built on top of the current one.
if we incorporated a "datestamp" into the u.r.l., and then made sure that
we archived _everything_ that was _ever_ put on the web (which is not
as absurd as it sounds, since we're _almost_ doing it already), then we
will essentially _have_ the stable infrastructure that's required, at no cost.
(the wayback machine at internet archive is the best example of this now.)


> Printed books have that data, but what kind of unique identifier
> do electronic documents offer?

none. until, that is, we give them one. which isn't difficult to do at all...


> There's no central clearing house, no Library of Congress or OCLC
> (the online cataloguing authority for books) or ISBN number as of yet.

don't need that. wouldn't want that. this is an easy problem to solve.
it just requires always-getting-cheaper diskspace, and the commitment.


> Speaking of Google books (which BowerBird mentions above),
> shouldn't someone point out to them that the scanning is being
> rather carelessly executed?

oh, it's been pointed out. over and over and over and over and over.
even by some of its big supporters, like me. repeatedly. problem is,
it just doesn't seem to be sinking in, not quite as deeply as it should.
(they _have_ improved. but quality, and quality-control, is still awful.)

-bowerbird
bowerbird is offline   Reply With Quote
Old 11-08-2007, 02:00 PM   #82
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
dalede said:
> I think paragraphs work for everything.

well, almost _anything_ will "work for everything"
if everyone agrees on how it will be implemented.


> After all a Stanza of poetry is really a paragraph in effect.

i know some people who would argue with you about that.
for a long time. they'd call you bad names for saying that.


> The idea is that the logical unit of the author is the paragraph.

maybe in your mind. but other authors could be very different.


> It is a cohesive thought and can often be determined
> easily even when the editions change.

i can show you edition-changes with changed paragraphs.

(but that's really neither here nor there, because any system
has to consider different editions to be different documents.
every pointer has to be relative to a specific edition, or else
you start getting into all kinds of very confusing messiness.)


> Paperback vs. hardback makes page number useless again.

not really. even if the pagination is different between the two
-- and sometimes it's not, but that's beside the point here --
when you're making a link, you simply link to one or the other...


> you are correct that text can be searched but sometimes
> there are duplicates if you fail to type enough and
> searches sometimes fail on word wraps. A specific reference
> is what is needed when referencing someone else's work.

there are some people who say that, because this issue is so
thorny right now, in our world of unstable documents, that
any text that you want to quote should just be included in
your own document. it's easy enough with copy-and-paste.

("what if you want to cite a whole article or book?", you ask.
then a system based on _search_ won't work for you anyway,
which is part of the problem with specifying such a system...)

-bowerbird
bowerbird is offline   Reply With Quote
Old 11-08-2007, 04:49 PM   #83
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Quote:
Originally Posted by DaleDe View Post
Are you talking about google? I think it is not guaranteed over the long haul.

Dale
No: http://www.archive.org
nekokami is offline   Reply With Quote
Old 11-08-2007, 05:12 PM   #84
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by nekokami View Post
Interesting. I didn't know that it existed. I searched it for my name and got zero hits but if I search google I get almost 10,000 hits so I think their search engine isn't too good. Thanks for the information.

Dale
DaleDe is offline   Reply With Quote
Old 11-08-2007, 07:29 PM   #85
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
I think it works best if you have an actual website to search for. I've managed to use it to dig up all kinds of pages that have disappeared over the years.
nekokami is offline   Reply With Quote
Old 11-08-2007, 08:09 PM   #86
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by nekokami View Post
I think it works best if you have an actual website to search for. I've managed to use it to dig up all kinds of pages that have disappeared over the years.
Thanks, that works for me. I can see this is valuable. Looks like my site has about 8 years of history stored there.

Dale
DaleDe is offline   Reply With Quote
Old 11-08-2007, 11:03 PM   #87
Panurge
Enthusiast
Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.
 
Panurge's Avatar
 
Posts: 34
Karma: 336
Join Date: Dec 2006
Location: Texas
Device: Sony Reader
DaleDe: "I think paragraphs work for everything. After all a Stanza of poetry is really a paragraph in effect. The idea is that the logical unit of the author is the paragraph. It is a cohesive thought and can often be determined easily even when the editions change. Paperback vs. hardback makes page number useless again."

Unfortunately, much poetry is not in stanzas, especially when it is written in blank verse.
Panurge is offline   Reply With Quote
Old 11-08-2007, 11:10 PM   #88
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,397
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Well there's no real reason why you cant have paragraphs that are a single line long for blank verse.
kovidgoyal is offline   Reply With Quote
Old 11-08-2007, 11:14 PM   #89
Panurge
Enthusiast
Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.Panurge has a complete set of Star Wars action figures.
 
Panurge's Avatar
 
Posts: 34
Karma: 336
Join Date: Dec 2006
Location: Texas
Device: Sony Reader
BowerBird: 'it is _not_ our job to make "a faithful representation of the print copy". we don't even _want_ to do that -- even if we could -- and we _cannot_, because any time you move a document from one medium to a completely different one, you're creating a new edition.'

I don't think a representation of the print copy in its visual layout is necessary, just an exact transcription of the content. I want to know that no words (or other linguistic elements, such as punctuation and paragraphing, for instance) have been added or substracted without some sort of indication that something was changed from the original source. That has been solid scholarly practice from the beginning, which is not the same as a facsimile. If the electronic version is guaranteed identical in that sense, then I would rarely need page numbers.

By the way, I find BowerBird's examples very interesting.

Last edited by Panurge; 11-08-2007 at 11:17 PM.
Panurge is offline   Reply With Quote
Old 11-09-2007, 01:10 AM   #90
sartori
Connoisseur
sartori began at the beginning.
 
Posts: 54
Karma: 29
Join Date: Oct 2006
Quote:
Originally Posted by Panurge View Post
I don't think a representation of the print copy in its visual layout is necessary, just an exact transcription of the content.
Panurge,

While I agree with the above statement when applied for research purposes. I feel that a decent representation of the source layout can add to the enjoyment when reading for pleasure. The attached image from Alice In Wonderland is one example where the layout of the text adds something to the book. I think if you start out with a solid facsimile of the original it's pretty easy to convert that to plain text or any other format for research purposes.

Of course, I agree with you that laying this out can be done pretty easily for a web page but as soon as you try to convert it for different devices like the sony reader it's pretty much a lost cause.

Ultimately I would love it if you could start with a master document that is a facsimile of the original print version for viewing online and then export it for individual devices and have the server 'automatically' remove any markup that is not supported for that device.

Just my 2 cents.

Rob

sartori is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Page numbers Fincary Astak EZReader 4 02-18-2010 03:06 PM
page numbers nenad Amazon Kindle 2 12-19-2009 09:01 AM
Professional and scholarly ebooks account for 75% of ebook market? anurag News 1 11-26-2009 12:40 PM
Page numbers, AGAIN orlincho Bookeen 92 08-19-2008 07:15 AM
Page numbers (again) Prospect Workshop 50 04-10-2008 02:19 AM


All times are GMT -4. The time now is 11:12 PM.


MobileRead.com is a privately owned, operated and funded community.