View Full Version : Scanning paper (out of copyright) books.


Charles Gray
06-14-2006, 10:58 PM
I have many, MANY books-- some of them are out of copyright, and for others I was able to get permission to ebook them so long as it isn't distriubted and the original copy is destroyed.
But that leaves the question of how do I do it? Flatbed scanners seem desructive and although I have a very good OCR program (Abby fine reader), the "lift" in the spine seems to cause problems. That's not a problme for the "Scan and destroy" books, but my out of copyright pulps from the 1920's are a different matter. (and rather important, as I'd like to read them, but too much reading will also destroy them). I didn't see any other place here to ask this question, so I was wondering if I could recieve any help.

ath
06-16-2006, 02:38 AM
But that leaves the question of how do I do it? Flatbed scanners seem desructive and although I have a very good OCR program (Abby fine reader), the "lift" in the spine seems to cause problems.

Unless you have access to an overhead scanner, scanning is very probably going to be destructive to some extent.

Scanning books quickly means, unfortunately, cutting them up, and running them through a page-fed scanner.

You can scan page spreads with a flat-bed scanner, but it will stress the spine and the hinges of the book in a way that doesn't happen with ordinary reading. I've done several late 19th century books on a largish flatbed, and if the books don't break up entirely, the back cover is usually ripped afterwards, and some of the sections are starting. There is also some risk of ripping or folding a page due to clumsy handling.

There are scanners where the scanning area extends to the edge of the device (see Plustek OpticBook 3600 (http://www.plustek.com/products/book.htm), or the 3600 Plus if you're going for PDF -- and I think Xerox has/had a similar scanner). This lessens the stress on the spine, but it doubles the effort and time, as well as doubles the risk of damaging the page.

I know of some experiments with a camera (a digital camera is a kind of overhead scanner, and with a film camera you can often get decent scans made from the film), but it definitely requires more than just point-and-click. You will at least need some kind of good camera stand, as well as good, even lighting. See project Runeberg (http://runeberg.org/admin/camera.html) for more info.

tribble
06-16-2006, 02:46 AM
What about taking photos in highres of the pages, like the professional bookscanners do. Then do a batch transform of your image to change the pages, that the distortion gets removed. then do the OCR.

DTM
06-22-2006, 08:16 AM
I'm sure you could get some help in the forum at the Distributed Proofreaders website. You may even want to run your projects through them, getting you an entire network of proofreaders.

Check it out at: www.pgdp.net

ereszet
09-28-2007, 08:34 AM
See my thread "do-it yourself repro v-cradle for paper books" in Reader Accessories

RWood
09-28-2007, 09:44 AM
There was a thread by Bob Russell about a scanner that was designed for bound books and had them over the corner of the scanner so a page would lie flat. It seemed to work well. I will look again for the article.

ricdiogo
09-28-2007, 12:45 PM
I'm sure you could get some help in the forum at the Distributed Proofreaders website. You may even want to run your projects through them, getting you an entire network of proofreaders.

Check it out at: www.pgdp.net

Charles Gray, DTM has given you a great advise. You would also be contributing for having more public domain ebooks freely available online at Project Gutenberg.

I also suggest you to read Project Gutenberg's Scanning FAQ (http://www.gutenberg.org/wiki/Gutenberg:Scanning FAQ).

Studio717
10-19-2007, 06:30 PM
There was a thread by Bob Russell about a scanner that was designed for bound books and had them over the corner of the scanner so a page would lie flat. It seemed to work well. I will look again for the article.

This is the Opticbook 3600. I have one and it does a great job with scanning. The edge of the glass is almost at the very edge of the scanner, so except for too-tightly bound (or usually, ime, rebound) books, it does a beautiful job of capturing all the text.

Any flatbed scanner is going to take longer to scan than an overhead setup like ereszet's (which is a setup I'm trying to recreate myself for a large book I have), but the Opticbook is the best out there as far as I've found for a low-cost flatbed solution.

latchkeyed
01-22-2009, 03:17 PM
You can also check out what we're doing at http://bkrpr.org. We have instructions for putting together a camera mount using cheap consumer digital cameras and a v book cradle like ereszet's. Actually I need to look into his version, mine is pretty much cobbled together wood.

glenn cornish
02-26-2009, 05:01 PM
have I seen hand held scanners which you can run over the page? If so, it would be slow, but non-destructive.

Glenn Cornish

Prospect
03-18-2009, 06:52 PM
Any one have any clues to what kind of magic I should ask my favourite image editor to perform in order to reduce the background noise of my scanned pages. I have tried working with saturation, hue, rgb-channels, contrast etc and the result is becomes better than the one straight from the scanner, but not as good as the Google books. My improvements are by change since I am clueless at this. Any one with some general advice on the matter or perhaps a linky?

Of course it would depend on a lot of factors how one should behave oneself to get the best result, but there should probably be some general rules or principles on the matter. (Trying to use Irfanview which has batch processing with advanced options. The point is to get them to my Cybook in one piece without any OCR)

zelda_pinwheel
03-18-2009, 06:54 PM
Any one have any clues to what kind of magic I should ask my favourite image editor to perform in order to reduce the background noise of my scanned pages.

i don't know irfanview but in photoshop i would try adjusting the levels (select the text as black, and a slightly noisy area of the page as white), and contrast.

DDHarriman
03-19-2009, 02:57 PM
OpticBook 3600 is the solution: cheap, easy efficient!

AnemicOak
03-19-2009, 03:56 PM
Any one have any clues to what kind of magic I should ask my favourite image editor to perform in order to reduce the background noise of my scanned pages. I have tried working with saturation, hue, rgb-channels, contrast etc and the result is becomes better than the one straight from the scanner, but not as good as the Google books. My improvements are by change since I am clueless at this. Any one with some general advice on the matter or perhaps a linky?

Of course it would depend on a lot of factors how one should behave oneself to get the best result, but there should probably be some general rules or principles on the matter. (Trying to use Irfanview which has batch processing with advanced options. The point is to get them to my Cybook in one piece without any OCR)


I've never scanned a book before, but do use a scanner many times a day. Did you scan your pages as RGB, Grayscale or Line Art (B&W)? Not sure which would be best, might depend on your source book quality.

Depends on what you mean by background noise, but have you tried messing with the images levels (that's what Photoshop calls it anyway), maybe that's what you meant by rgb-channels. Usually if I have some speckles or something in the background (white) part of a scan I can mess with the levels and get rid of it. Some software has a despeckle option that I'm told can be useful on some books, but it'll also get rid of punctuation a lot of the time.

Elfwreck
03-19-2009, 03:58 PM
In most cases, it's best to scan books as line art. From there, you can play with the brightness & contrast settings (depending on the scanner) to get a better quality scan, and later use Irfanview or something like it if the pages need more editing.

CharlieBird
03-19-2009, 04:58 PM
'Scanning' bookshelves yesterday w/a dust cloth and recalling Bob Russell 's informative thread of last October, I finally got around to ordering the Opticbook A4 from:
Provantage, which has both the 3600 A4 and the Plus ($223 y $263).
http://www.provantage.com/plustek-91n-bbm31~4PLUS003.htm

Camann
03-25-2009, 10:54 AM
'Scanning' bookshelves yesterday w/a dust cloth and recalling Bob Russell 's informative thread of last October, I finally got around to ordering the Opticbook A4 from:
Provantage, which has both the 3600 A4 and the Plus ($223 y $263).
http://www.provantage.com/plustek-91n-bbm31~4PLUS003.htm

I just bought the 3600 Plus from Newegg ($250 plus free shipping). I kept vacillating between the basic model and the Plus. I read many reviews about the iffy quality/functionality of the software on the Plus. I think most people advise obtaining additional software anyway. I'm fairly tech-challenged. So, since the price difference between the two models was low, I thought I'd take a chance that what's in the box will enable me to achieve my goal of scanning and converting books to pdf files for use on my reader.

PieOPah
03-25-2009, 12:40 PM
I bought the OpticBook 3600 a few months ago and am really happy with it.

On average it takes me about 2 hours to scan an entire book.

The biggest problem then is correcting the mistakes... While there are few mistakes on the whole, ABBY Reader still like to highlight hundreds of things it believes might be errors (most of which aren't!!!) That part of the process takes several hours (which I do in batches when I ave the time).

Finally I edit the document in Word and convert to LRF.

In total it probably takes around 10 hours per book.

pepak
03-25-2009, 02:06 PM
On average it takes me about 2 hours to scan an entire book.
100 pages in 20 minutes is the average time I do.

The biggest problem then is correcting the mistakes... While there are few mistakes on the whole, ABBY Reader still like to highlight hundreds of things it believes might be errors (most of which aren't!!!) That part of the process takes several hours (which I do in batches when I ave the time).
Personally, I just can't stand doing it in FineReader. Generally I just produce a HTML file, convert it to LRF and put it into my Reader. Then I read the book normally, bookmarking every page with errors. When I have some 20-30 pages marked, I go through the them again, correcting the HTML file.

While this approach does take its time, it seems a lot more efficient than correcting in FR - I would have wanted to read the book anyway, so that time would be spent in any case. Basically, it only costs me the time to go through the bookmarked pages and correcting their errors.