Scanning books

Nate the great · 10-31-2007, 02:18 PM

Does anyone have any scanned images of public domain books they could share? I am working on a project for my Computer Vision class, and need a diverse sample group. JPEG is preferred, but any common format is okay.

My project is aimed at typed manuscripts. My program will accept an image, correct the orientation, remove the pictures and black marks, and assemble the word images into a new image file.

If works well, I will likely adapt it to work off PDFs and post it here.

vivaldirules · 10-31-2007, 02:49 PM

Nate, I'd suggest you go here http://www.archive.org/details/texts and take your pick. You'll have a wide selection including some with images, some without, some with distorted text, different fonts, etc.

Oops. Finally really read your post. You wanted jpg not pdf. Sorry.

Nate the great · 10-31-2007, 03:00 PM

Quote:

Originally Posted by vivaldirules

Oops. Finally really read your post. You wanted jpg not pdf. Sorry.

Correct. I currently don't have tools to extract the images from PDFs.

kovidgoyal · 10-31-2007, 03:03 PM

pdftohtml which is part of libprs500 extracts images from PDFs. It's not perfect, it may not get all images, but it might be good enough for your needs.

vivaldirules · 11-02-2007, 05:27 PM

Nate, would these do? These grayscale scans are from my 1200 dpi HP scanner. I tried to give you a variety of fonts, some typical foxing and other background problems, some with a bit of a tilt to the scan image, etc. I'm not sure what you want. If you want more or something different let us know and I'll try again.

nekokami · 11-02-2007, 06:30 PM

Why reassemble the word images, instead of OCR? I mean, beyond the obvious thought that this may fill a requirement for your course.

Nate the great · 11-02-2007, 06:58 PM

Quote:

Originally Posted by nekokami

Why reassemble the word images, instead of OCR? I mean, beyond the obvious thought that this may fill a requirement for your course.

Developing a decent OCR program from scratch is a Master's thesis or PhD level project (according to my professor). I think I could do it now. But if it's worth that much I want to get the full value.

Robert Marquard · 11-03-2007, 02:27 AM

Definitely join Distributed Proofreaders. http://www.pgdp.net
They have scans either harvested or done by volunteers. They will also happily answer any question.

nekokami · 11-03-2007, 11:12 PM

Quote:

Originally Posted by Nate the great

Developing a decent OCR program from scratch is a Master's thesis or PhD level project (according to my professor). I think I could do it now. But if it's worth that much I want to get the full value.

Makes sense. It sounds like what you're working on could be a good clean-up phase before OCR, too.

Too bad you're not also a robotics student-- we really need a cheaper page-turning scanbot!

jbenny · 11-04-2007, 01:16 AM

Quote:

Originally Posted by nekokami

Too bad you're not also a robotics student-- we really need a cheaper page-turning scanbot!

Just contract it out to Walmart. They'll have those little Chinese kids turning pages like you wouldn't believe. Oh, I'm bad...

bowerbird · 11-04-2007, 01:20 AM

you don't need to join distributed proofreaders to snag some scans:
> http://www.pgdp.org/ols/

you can also find scan-sets in the same places that d.p. finds 'em:
> http://www.pgdp.net/wiki/Sources_for_Scan_Harvesting

***

nekokami said:
> Why reassemble the word images, instead of OCR?

well, believe it or not, that's one way of doing "reflow".
parc did a paper on it a while back. here's the info:
> Paper to PDA
> Thomas M. Breuel, William C. Janssen, Kris Popat, Henry S. Baird
> 11 August 2002
> TR−01−2

***

nate said:
> Developing a decent OCR program from scratch
> is a Master's thesis or PhD level project
> (according to my professor).

um, he's pulling your leg. developing a decent o.c.r. program is
_immensely_ difficult. even with a headstart they obtained from
adopting a project from elsewhere, google discovered it's hard...
take a look at their recent alpha of ocropus to get a rough idea:
> http://code.google.com/p/ocropus/

-bowerbird

10-31-2007, 02:18 PM	#1
Nate the great Sir Penguin of Edinburgh Posts: 12,375 Karma: 23555235 Join Date: Apr 2007 Location: DC Metro area Device: Shake a stick plus 1	Scanning books Does anyone have any scanned images of public domain books they could share? I am working on a project for my Computer Vision class, and need a diverse sample group. JPEG is preferred, but any common format is okay. My project is aimed at typed manuscripts. My program will accept an image, correct the orientation, remove the pictures and black marks, and assemble the word images into a new image file. If works well, I will likely adapt it to work off PDFs and post it here.

11-02-2007, 05:27 PM	#5
vivaldirules When's Doughnut Day? Posts: 10,059 Karma: 13675475 Join Date: Jul 2007 Location: Houston, TX, US Device: Sony PRS-505, iPad	Nate, would these do? These grayscale scans are from my 1200 dpi HP scanner. I tried to give you a variety of fonts, some typical foxing and other background problems, some with a bit of a tilt to the scan image, etc. I'm not sure what you want. If you want more or something different let us know and I'll try again. Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Scanning in your own books	gazza	News	125	01-24-2016 04:42 PM
calibre crashes when scanning and adding books	oncdoc	Calibre	8	04-21-2010 03:03 PM
Scanning books - New need help	Sporadic	Workshop	9	04-19-2009 01:11 PM
Scanning paper (out of copyright) books.	Charles Gray	Workshop	18	03-25-2009 02:06 PM
Scanning books from your own library	Alexander Turcic	Deals and Resources (No Self-Promotion or Affiliate Links)	13	06-16-2006 12:28 AM

10-31-2007, 02:49 PM	#2
vivaldirules When's Doughnut Day? Posts: 10,059 Karma: 13675475 Join Date: Jul 2007 Location: Houston, TX, US Device: Sony PRS-505, iPad	Nate, I'd suggest you go here http://www.archive.org/details/texts and take your pick. You'll have a wide selection including some with images, some without, some with distorted text, different fonts, etc. Oops. Finally really read your post. You wanted jpg not pdf. Sorry.

10-31-2007, 03:03 PM	#4
kovidgoyal creator of calibre Posts: 44,346 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	pdftohtml which is part of libprs500 extracts images from PDFs. It's not perfect, it may not get all images, but it might be good enough for your needs.

11-02-2007, 06:30 PM	#6
nekokami fruminous edugeek Posts: 6,745 Karma: 551260 Join Date: Oct 2006 Location: Northeast US Device: iPad, eBw 1150	Why reassemble the word images, instead of OCR? I mean, beyond the obvious thought that this may fill a requirement for your course.

11-03-2007, 02:27 AM	#8
Robert Marquard Delphi-Guy Posts: 285 Karma: 1151 Join Date: May 2006 Location: Berlin, Germany Device: iLiad, Palm T3	Definitely join Distributed Proofreaders. http://www.pgdp.net They have scans either harvested or done by volunteers. They will also happily answer any question.

11-04-2007, 01:20 AM	#11
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	you don't need to join distributed proofreaders to snag some scans: > http://www.pgdp.org/ols/ you can also find scan-sets in the same places that d.p. finds 'em: > http://www.pgdp.net/wiki/Sources_for_Scan_Harvesting * nekokami said: > Why reassemble the word images, instead of OCR? well, believe it or not, that's one way of doing "reflow". parc did a paper on it a while back. here's the info: > Paper to PDA > Thomas M. Breuel, William C. Janssen, Kris Popat, Henry S. Baird > 11 August 2002 > TR−01−2 * nate said: > Developing a decent OCR program from scratch > is a Master's thesis or PhD level project > (according to my professor). um, he's pulling your leg. developing a decent o.c.r. program is _immensely_ difficult. even with a headstart they obtained from adopting a project from elsewhere, google discovered it's hard... take a look at their recent alpha of ocropus to get a rough idea: > http://code.google.com/p/ocropus/ -bowerbird