you don't need to join distributed proofreaders to snag some scans:
>
http://www.pgdp.org/ols/
you can also find scan-sets in the same places that d.p. finds 'em:
>
http://www.pgdp.net/wiki/Sources_for_Scan_Harvesting
***
nekokami said:
> Why reassemble the word images, instead of OCR?
well, believe it or not, that's one way of doing "reflow".
parc did a paper on it a while back. here's the info:
> Paper to PDA
> Thomas M. Breuel, William C. Janssen, Kris Popat, Henry S. Baird
> 11 August 2002
> TR−01−2
***
nate said:
> Developing a decent OCR program from scratch
> is a Master's thesis or PhD level project
> (according to my professor).
um, he's pulling your leg. developing a decent o.c.r. program is
_immensely_ difficult. even with a headstart they obtained from
adopting a project from elsewhere, google discovered it's hard...
take a look at their recent alpha of ocropus to get a rough idea:
>
http://code.google.com/p/ocropus/
-bowerbird