View Single Post
Old 11-08-2007, 04:17 AM   #77
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
sartori said:
> those pages I added were time consuming
> but mainly because I was figuring out the layout.
> I do plan on working through the whole book
> but I haven't found a plain text version available
> so I am ocr'ing the pdf from archive.org.
> This is currently the slowest part as I am
> proofing and converting quotes and dashes over.

um, gee, you might be missing something very important.

if you got it from archive.org, then it was almost certainly
scanned by the o.c.a., which means that -- right alongside
the .pdf copy -- you should find the o.c.r. they did on it...

i couldn't find volume 1, but some volumes from this series
certainly have their text available. sometimes you need to
click on the "ftp" link to find _all_ of the files that they offer.
if you see nothing labeled as ".txt", seek the "djvu.text" file.

however, in a spectacular display of sheer incompetence,
sometimes the text files are burdened by severe problems,
some of which can even border on fatal. i won't bother to
go into the details here, but check the text _carefully_ first,
before going on to pour work into it, or you might regret it.

so you might well end up doing o.c.r. on the .pdf anyway.
but i'd still suggest you should check out their text first...


> For example, if you increase the display font size
> in your browser, the pages expand lengthwise
> to accommodate it. It just runs into problems with items
> that are specifically positioned, such as the table of contents.

another problem that you need to be aware of -- which might
or might not be something you consider serious -- is when a
paragraph is split across a pagebreak -- as they usually are --
because then the text won't fill out the bottom line of the page,
which is what people expect to see in that situation. the reflow
(will often) end in the middle of the line, the impression is that
the paragraph has ended, which can be disconcerting to people.


> If so it wouldn't be too hard to created a library of books
> that display paged as in my example but then you could
> easily convert them to lrf and ignore page numbers, etc.

but if the page as displayed doesn't fit correctly on the screen,
then you'll have "pagebreaks" occurring mid-screen, correct?
which kind of defeats the whole purpose of a paged display...

***

jbenny said:
> They have apparently OCRed the text

um, well of course google does o.c.r. on the scans.
how else would they be able to do searches on it?


> as you can "view text" for each individual page.

they do that so as to provide access to the visually-impaired.


> Sadly, the downloadable PDF doesn't include the OCRed text.

that's because they don't really want you to have the text.
well, they probably don't care if _you_ have it, but they
don't want all of the _other_ search engines to have it...

***

sartori said:
> I just checked those out and they appear to be from
> a slightly different version than the ones on archive.org
> (and they have all 31 volumes). As my goal is to represent
> the printed version, the differences may become a problem
> with page numbers being different.

so strange. did this series with 31 volumes _really_ go through
several editions? i guess it's not impossible, but it'd suprise me.
are you sure that it's not just _flakiness_ in the p.g. digitization?

because one appealing aspect of the p.g. versions in general is
that they've been subjected to some proofreading, which means
-- if nothing less -- that you can compare them to your output,
because the differences between the two versions will point to
errors in one (or both) of them. indeed, this has provem to be
one of the _most_ effective ways of doing "proofing" on a text...

-bowerbird
bowerbird is offline   Reply With Quote