|
|
View Full Version : Scanning books
Nate the great 10-31-2007, 02:18 PM Does anyone have any scanned images of public domain books they could share? I am working on a project for my Computer Vision class, and need a diverse sample group. JPEG is preferred, but any common format is okay.
My project is aimed at typed manuscripts. My program will accept an image, correct the orientation, remove the pictures and black marks, and assemble the word images into a new image file.
If works well, I will likely adapt it to work off PDFs and post it here.
vivaldirules 10-31-2007, 02:49 PM Nate, I'd suggest you go here http://www.archive.org/details/texts and take your pick. You'll have a wide selection including some with images, some without, some with distorted text, different fonts, etc.
Oops. Finally really read your post. You wanted jpg not pdf. Sorry.
Nate the great 10-31-2007, 03:00 PM Oops. Finally really read your post. You wanted jpg not pdf. Sorry.
Correct. I currently don't have tools to extract the images from PDFs.
kovidgoyal 10-31-2007, 03:03 PM pdftohtml which is part of libprs500 extracts images from PDFs. It's not perfect, it may not get all images, but it might be good enough for your needs.
vivaldirules 11-02-2007, 05:27 PM Nate, would these do? These grayscale scans are from my 1200 dpi HP scanner. I tried to give you a variety of fonts, some typical foxing and other background problems, some with a bit of a tilt to the scan image, etc. I'm not sure what you want. If you want more or something different let us know and I'll try again.
nekokami 11-02-2007, 06:30 PM Why reassemble the word images, instead of OCR? I mean, beyond the obvious thought that this may fill a requirement for your course.
Nate the great 11-02-2007, 06:58 PM Why reassemble the word images, instead of OCR? I mean, beyond the obvious thought that this may fill a requirement for your course.
Developing a decent OCR program from scratch is a Master's thesis or PhD level project (according to my professor). I think I could do it now. But if it's worth that much I want to get the full value.
Robert Marquard 11-03-2007, 02:27 AM Definitely join Distributed Proofreaders. http://www.pgdp.net
They have scans either harvested or done by volunteers. They will also happily answer any question.
nekokami 11-03-2007, 11:12 PM Developing a decent OCR program from scratch is a Master's thesis or PhD level project (according to my professor). I think I could do it now. But if it's worth that much I want to get the full value.
Makes sense. It sounds like what you're working on could be a good clean-up phase before OCR, too. :)
Too bad you're not also a robotics student-- we really need a cheaper page-turning scanbot!
jbenny 11-04-2007, 01:16 AM Too bad you're not also a robotics student-- we really need a cheaper page-turning scanbot!
Just contract it out to Walmart. They'll have those little Chinese kids turning pages like you wouldn't believe. Oh, I'm bad... :)
bowerbird 11-04-2007, 01:20 AM you don't need to join distributed proofreaders to snag some scans:
> http://www.pgdp.org/ols/
you can also find scan-sets in the same places that d.p. finds 'em:
> http://www.pgdp.net/wiki/Sources_for_Scan_Harvesting
***
nekokami said:
> Why reassemble the word images, instead of OCR?
well, believe it or not, that's one way of doing "reflow".
parc did a paper on it a while back. here's the info:
> Paper to PDA
> Thomas M. Breuel, William C. Janssen, Kris Popat, Henry S. Baird
> 11 August 2002
> TR−01−2
***
nate said:
> Developing a decent OCR program from scratch
> is a Master's thesis or PhD level project
> (according to my professor).
um, he's pulling your leg. developing a decent o.c.r. program is
_immensely_ difficult. even with a headstart they obtained from
adopting a project from elsewhere, google discovered it's hard...
take a look at their recent alpha of ocropus to get a rough idea:
> http://code.google.com/p/ocropus/
-bowerbird
|