you don't need to join distributed proofreaders to snag some scans:
you can also find scan-sets in the same places that d.p. finds 'em:
> Why reassemble the word images, instead of OCR?
well, believe it or not, that's one way of doing "reflow".
parc did a paper on it a while back. here's the info:
> Paper to PDA
> Thomas M. Breuel, William C. Janssen, Kris Popat, Henry S. Baird
> 11 August 2002
> Developing a decent OCR program from scratch
> is a Master's thesis or PhD level project
> (according to my professor).
um, he's pulling your leg. developing a decent o.c.r. program is
_immensely_ difficult. even with a headstart they obtained from
adopting a project from elsewhere, google discovered it's hard...
take a look at their recent alpha of ocropus to get a rough idea: