10-31-2007, 02:18 PM | #1 |
Sir Penguin of Edinburgh
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
Scanning books
Does anyone have any scanned images of public domain books they could share? I am working on a project for my Computer Vision class, and need a diverse sample group. JPEG is preferred, but any common format is okay.
My project is aimed at typed manuscripts. My program will accept an image, correct the orientation, remove the pictures and black marks, and assemble the word images into a new image file. If works well, I will likely adapt it to work off PDFs and post it here. |
10-31-2007, 02:49 PM | #2 |
When's Doughnut Day?
Posts: 10,059
Karma: 13675475
Join Date: Jul 2007
Location: Houston, TX, US
Device: Sony PRS-505, iPad
|
Nate, I'd suggest you go here http://www.archive.org/details/texts and take your pick. You'll have a wide selection including some with images, some without, some with distorted text, different fonts, etc.
Oops. Finally really read your post. You wanted jpg not pdf. Sorry. |
10-31-2007, 03:00 PM | #3 |
Sir Penguin of Edinburgh
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
|
10-31-2007, 03:03 PM | #4 |
creator of calibre
Posts: 44,346
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
pdftohtml which is part of libprs500 extracts images from PDFs. It's not perfect, it may not get all images, but it might be good enough for your needs.
|
11-02-2007, 05:27 PM | #5 |
When's Doughnut Day?
Posts: 10,059
Karma: 13675475
Join Date: Jul 2007
Location: Houston, TX, US
Device: Sony PRS-505, iPad
|
Nate, would these do? These grayscale scans are from my 1200 dpi HP scanner. I tried to give you a variety of fonts, some typical foxing and other background problems, some with a bit of a tilt to the scan image, etc. I'm not sure what you want. If you want more or something different let us know and I'll try again.
|
11-02-2007, 06:30 PM | #6 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Why reassemble the word images, instead of OCR? I mean, beyond the obvious thought that this may fill a requirement for your course.
|
11-02-2007, 06:58 PM | #7 |
Sir Penguin of Edinburgh
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
Developing a decent OCR program from scratch is a Master's thesis or PhD level project (according to my professor). I think I could do it now. But if it's worth that much I want to get the full value.
|
11-03-2007, 02:27 AM | #8 |
Delphi-Guy
Posts: 285
Karma: 1151
Join Date: May 2006
Location: Berlin, Germany
Device: iLiad, Palm T3
|
Definitely join Distributed Proofreaders. http://www.pgdp.net
They have scans either harvested or done by volunteers. They will also happily answer any question. |
11-03-2007, 11:12 PM | #9 | |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Quote:
Too bad you're not also a robotics student-- we really need a cheaper page-turning scanbot! |
|
11-04-2007, 01:16 AM | #10 |
Addict
Posts: 323
Karma: 358
Join Date: May 2007
Device: Tablet PC and Nokia N800
|
|
11-04-2007, 01:20 AM | #11 |
Banned
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
|
you don't need to join distributed proofreaders to snag some scans:
> http://www.pgdp.org/ols/ you can also find scan-sets in the same places that d.p. finds 'em: > http://www.pgdp.net/wiki/Sources_for_Scan_Harvesting *** nekokami said: > Why reassemble the word images, instead of OCR? well, believe it or not, that's one way of doing "reflow". parc did a paper on it a while back. here's the info: > Paper to PDA > Thomas M. Breuel, William C. Janssen, Kris Popat, Henry S. Baird > 11 August 2002 > TR−01−2 *** nate said: > Developing a decent OCR program from scratch > is a Master's thesis or PhD level project > (according to my professor). um, he's pulling your leg. developing a decent o.c.r. program is _immensely_ difficult. even with a headstart they obtained from adopting a project from elsewhere, google discovered it's hard... take a look at their recent alpha of ocropus to get a rough idea: > http://code.google.com/p/ocropus/ -bowerbird |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Scanning in your own books | gazza | News | 125 | 01-24-2016 04:42 PM |
calibre crashes when scanning and adding books | oncdoc | Calibre | 8 | 04-21-2010 03:03 PM |
Scanning books - New need help | Sporadic | Workshop | 9 | 04-19-2009 01:11 PM |
Scanning paper (out of copyright) books. | Charles Gray | Workshop | 18 | 03-25-2009 02:06 PM |
Scanning books from your own library | Alexander Turcic | Deals and Resources (No Self-Promotion or Affiliate Links) | 13 | 06-16-2006 12:28 AM |