View Full Version : how to digitize books


user
10-04-2007, 08:49 AM
hello

I would like to digitize a book, by taking photos of
the book pages and then performing OCR in them

can you tell me please what characteristics must a
camera have to do this? big zoom? many megapixels?
specific features?

OCR needs a 300dpi scan from a scanner, so can you
tell me please which is the equivalent for a digital
camera photo? I mean how many megapixels and which
distance from the source, how much lighting etc

any specific settings of the camera? does the room
need to be very lighted? do I need a tripod? and
specific add-ons to the camera? any software?
any suggestion would be much appreciated

also these book scanners use cameras:
kirtas-tech.com
atiz.com
and their scan samples are marvelous

can I reproduce these results? with cheaper way

thanks

BKeeper
10-04-2007, 11:51 AM
It depends on the size and distance of the source. generally you'll get good results starting with 8 MP. (but I'd go for 12)

It will be good if your camera has some kind of auto-shutter function, so that you can program a fixed interval.

Getting the lighting right is kinda tricky.
Also you'll need a capable OCR. FineReader and Omnipage 16 have options to perform OCR on digital camera pictures, (correct perspective, distortion, and lighting...)

Keep in mind that using a digital camera, OCR results won't be as acurate.

If you can afford to do destructive scanning, then I'd advise you to get a sheet-fed scanner (check kodak and fujitsu), you'll get much better results.

If you still want to go with your digital camera then check this thread (http://www.mobileread.com/forums/showthread.php?t=13848) It has exactly what you need.

Hope this helps

nekokami
10-04-2007, 01:01 PM
You might want to consider using a flat sheet of glass or thick plastic to hold the pages very flat while you photograph them, so you get less distortion from curved pages.

user
10-05-2007, 02:42 AM
thanks for your replies

would it be better to shoot with a film camera, then print the photos from film and then shetfeed the scanner with them?

much more time and cost, but will it be better?

slayda
10-05-2007, 09:38 AM
hello

I would like to digitize a book, by taking photos of
the book pages and then performing OCR in them

can you tell me please what characteristics must a
camera have to do this? big zoom? many megapixels?
specific features?



Check ABBYY Finreader at www.abbyy.com. Their version 8.0 has that ability included & may have recommendations for camera & TECHNIQUE.

user
10-05-2007, 09:45 AM
I contacted them for recommendations over camera + technique with no result, anyone with better luck?

Steven Lyle Jordan
10-05-2007, 10:10 AM
thanks for your replies

would it be better to shoot with a film camera, then print the photos from film and then shetfeed the scanner with them?

much more time and cost, but will it be better?

It would be cheaper, faster and more effective than camera if you took the book to a good photocopy machine. They already output the image on 8.5x11 or A4 paper, already suited for sheetfed scanners, and cost less than photo output to film (or even paper).

If you go this route, use a photocopier with a zoom control. Increase the zoom until your book page literally fills the photocopy image. Then you'll have the largest-possible text images on paper, which will run perfectly through a sheetfed scanner, be easier for the OCR to recognize, and reduce your reco errors.

Copying the book page by page will also be faster than doing the same with a camera, then outputting the camera image.

NatCh
10-05-2007, 12:50 PM
Even in the digital age, sometimes the old ways are best, eh, Steve? :grin:

Steven Lyle Jordan
10-05-2007, 02:24 PM
'Fraid so! I've never found a faster, easier and more accurate way to scan text than this. The best part is, it breaks up the job into stages... assembly-line, as it were... making the entire process easier to manage.

user
10-05-2007, 03:14 PM
ok but photocopying the book is the same as scanning it, isnt it?

ereszet
10-05-2007, 03:18 PM
It would be cheaper, faster and more effective than camera if you took the book to a good photocopy machine. They already output the image on 8.5x11 or A4 paper, already suited for sheetfed scanners, and cost less than photo output to film (or even paper).

If you go this route, use a photocopier with a zoom control. Increase the zoom until your book page literally fills the photocopy image. Then you'll have the largest-possible text images on paper, which will run perfectly through a sheetfed scanner, be easier for the OCR to recognize, and reduce your reco errors.

Copying the book page by page will also be faster than doing the same with a camera, then outputting the camera image.

It will be cheaper only if you use an office copier for your private copying (no investment and no running cost for you). It will be faster only if your secretary does the copying. Your workflow will not reproduce color images well enough, even with a color copier. Increasing the zoom beyond a certain limit will spoil the OCR rather than improve it. The advice from Finereader is not to manipulate the images unless you have to. If you flatten a book with the copier cover you get curved lines of text and you damage the book to some extent.

With my camera I can take photos of documents every 3 seconds or so. No copier can match that. Results are good enough for OCR. High quality repro requires a little more than a camera. See my thread http://www.mobileread.com/forums/showthread.php?t=13848

Steven Lyle Jordan
10-05-2007, 04:14 PM
ok but photocopying the book is the same as scanning it, isnt it?

No: Photopying (aka "Xeroxing") puts the book pages onto standard paper ready for sheetfed scanners.

Steven Lyle Jordan
10-05-2007, 04:23 PM
It will be faster only if your secretary does the copying.

"Faster for who?" the secretary opined.

Your workflow will not reproduce color images well enough, even with a color copier.

Excuse me... I thought we were talking about text.

Increasing the zoom beyond a certain limit will spoil the OCR rather than improve it.

Generally, a zoom of only about 130% is enough to fill a letter or A4 page. That doesn't spoil OCR.

With my camera I can take photos of documents every 3 seconds or so. No copier can match that.

Check out some modern high-speed copiers. A lot of them can match that, and are only slowed up by the rate at which you can change the page.

(Not trying to bust your chops. Just being fair.)

ereszet
10-05-2007, 06:07 PM
[QUOTE=Steve Jordan;103351
Excuse me... I thought we were talking about text. [/QUOTE]

Books come with images, photos and maps. A disadvantage of Gutenberg project is that it is limited to text only. I have a collection of thousands of pdf/djvu books and maps coming from free digital libraries that look exactly like originals. That is also what I do with my documents/ books/ business cards, magazines, newspaper clips, etc. by photoscanning. Then I have to process them to remove whatever is wrong due to my not taking proper care at the photoscanning stage and OCR them to index.

For your info: just one of my folders contains over 5 thousand documents with over 5 million word count. The size of the folder is 30 Gb and the size of the index is 500 Mb. In total my collection of indexed books is close to 100 Gb.

Text alone is too easy to scan or photocopy to worry about it too much. In practice there are no lighting problems, just a steady hand and a good focus.