Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 08-14-2006, 03:21 PM   #1
Bob Russell
Recovering Gadget Addict
Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.
 
Bob Russell's Avatar
 
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
Simplest scanning methods and equipment

Let me ask about a topic that I think a lot of ebook fans are curious about, but I'm not sure anyone has a good answer for...

Suppose we want to scan a book that is not under copyright. It could be a hardback, paperback, or a binder of printed 8.5x11" pages. Assume the following:

* We want the whole process to be easy and fast and cheap, but mostly easy and not so slow that it's not worth bothering.
* We want to spend a minimal amount of money on scanners or software.
* We don't care much about quality. Assume it's for personal reading only.
* A reasonable amount of OCR errors is tolerable as long as one can still read the text without too much trouble. So it's okay if "cl" sometimes becomes a "d", or vice versa, and so forth.
* The book has a fairly standard layout and a fairly normal font style.
* We are willing to destroy the book if necessary (unbind, cut, etc)
* No pictures or formatting are required, just the text for reading on a small screen device.

I realize there may be multiple answers depending on whether or not the book is destroyed, or based on the size of the book.

Edited to make the question more evident...
Has anyone already got a good answer for [how to do] this, or do we all need to start from scratch if we decide to jump into such a project?

BTW, the copyright issue may limit some of us to old books that we don't want to destroy. Any answers for the case we don't want to damage the book? Or suppose we are willing to accept images instead of characters, such as a pdf scan of a math book to keep diagrams and symbols, etc. Any different solutions now?

Last edited by Bob Russell; 08-14-2006 at 04:04 PM. Reason: For clarity...
Bob Russell is offline   Reply With Quote
Old 08-14-2006, 03:59 PM   #2
ath
Addict
ath doesn't litterath doesn't litter
 
Posts: 222
Karma: 110
Join Date: Jun 2006
Location: Malmo, Sweden
Device: iLiad, Sony PRS-505, Kindle Paperwhite & Oasis
I'm not sure I found the question in here -- it mostly seemed to be a set of conditions.

Quote:
Originally Posted by Bob Russell
Let me ask about a topic that I think a lot of ebook fans are curious about, but I'm not sure anyone has a good answer for...
You can't have both easy and fast and cheap, I don't think.

If you want it very easy, go to a scanning bureau (fast, but not necessarily cheap). Or buy a good sheet-fed scanner ... but the good and fast ones are rarely cheap.

If you want it cheap, it's going to be a lot of work (cut up the book, scan page by page in cheapest scanner you find -- if book is small enough to fit on the scanning platen, cutting may not be necessary.) Page feeder is an option, but I don't trust the cheap ones: they tend to either crash the paper, or misfeed. And cheap scanning software tends to be a serious pain.

I've done hand scanning (HP IIP & 3C), cheap sheet feed (HP 5550C -- not recommended for large jobs, particularly not needing sheet feeding), and somewhat more expensive sheet feed (Fujitsu fi-4120). I still do hand scanning from time to time.

If you don't want to cut up the book or otherwise mistreat it, and still use a flatbed scanner, forget it. Hand-scanning it will stress the binding so much that you will probably damage the book in one way or another. The only way to do this is by using an overhead scanner, but they tend to be quite expensive.

There have been some tests with using digital cameras as scanning devices. That is a kind of improvised overhead scanner -- see http://runeberg.org/admin/camera.html . However, for best results you probably need to do single page images, rather than page spreads.

Oh, wait ... I forgot. There's actually one more way. Try http://www.archive.org/ Someone may have done the job already. Just grab the page images, and feed them to whatever program will create the final document. Some works have already been OCR'd.

If you don't particularly want page images, but only want the text in reasonably good shape, also try looking for it at http://digital.library.upenn.edu/books/ . They also have pages with lots of copyright-related information.

Last edited by ath; 08-14-2006 at 04:04 PM.
ath is offline   Reply With Quote
Advert
Old 08-14-2006, 04:14 PM   #3
Bob Russell
Recovering Gadget Addict
Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.
 
Bob Russell's Avatar
 
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
You're absolutely right, ath, that's probably a bit open-ended. And there are definitely tradeoffs. I'll try to zoom in a bit more to a specific likely scenario. (And I also emphasized my question in the original post, so it doesn't get lost in the verbage!)

Let's say we want to be able to do it with hardware/software combos in the $100-$600 range (not including the PC, of course) . And let's say that we don't want to spend more than 3-4 "attended" man hours on the whole operation... prepping, scanning, OCR and creating the text file for a 500 page book.

Any unattended time, say overnight unattended processing, would not count in the time requirement.

Is this possible, or still a bit out of the reach of current technology?
Bob Russell is offline   Reply With Quote
Old 08-14-2006, 05:33 PM   #4
ath
Addict
ath doesn't litterath doesn't litter
 
Posts: 222
Karma: 110
Join Date: Jun 2006
Location: Malmo, Sweden
Device: iLiad, Sony PRS-505, Kindle Paperwhite & Oasis
Quote:
Originally Posted by Bob Russell
Let's say we want to be able to do it with hardware/software combos in the $100-$600 range (not including the PC, of course) . And let's say that we don't want to spend more than 3-4 "attended" man hours on the whole operation... prepping, scanning, OCR and creating the text file for a 500 page book.
Doing 500 pages by hand means either 500 pages or 250 page spreads at flatbed scanner speed. (With a old and sluggish scanner, I do 3 spreads per minute, 300 dpi, b/w, so perhaps 2 hours, spread over four sessions: after about 30 minutes, fumbling tends to increase. That means that some pages either get scanned twice or not at all.) Cutting up may not be required, if the scanner is large enough.

Cutting up would mean about 10-15 minutes cutting up the book (Fiskars roller knife & an iron straightedge), and then it's up to the scanner speed for the rest. (Putting the pages in press for a day or two makes feeding easier. My fi-4120 - sheetfed - would do it with about 5 misfeeds in 30-40 minutes, but it's out of range for the proposed budget.)

For creating a document, I'd use ABBYY FineReader Pro. Can do both OCR, or just repackage image files as a image PDF. (I'm told Omnipage can do the job, but I've never had much success with that program myself. YMMV, as always). As long as there's not going to be any serious proofreading, FineReader will do around one page / second on a modern system, and as long as image resolution is 600 dpi it won't do uncomfortably many misreads. (THis would count as 95% unattended, I think.)

I think it's possible ... perhaps not at $100, but around $200, 250 or so.

Price and market for scanners is difficult, and I have very little idea of the situation in the US (as you mentioned $, I assume US). For the money, I believe I would have a better chance getting a good flatbed than an equally good sheetfed scanner (more exposed mechanics). I'd check the compatibility list for FineReader to ensure I can scan straight into the program, preferrably through WIA drivers (Twain often seems to be dumbed-down, I find), and start reading reviews.

Last edited by ath; 08-14-2006 at 05:48 PM.
ath is offline   Reply With Quote
Old 08-14-2006, 11:31 PM   #5
Liviu_5
Books and more books
Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.Liviu_5 juggles neatly with hedgehogs.
 
Liviu_5's Avatar
 
Posts: 917
Karma: 69499
Join Date: Mar 2006
Location: White Plains, NY, USA
Device: Nook Color, Itouch, Nokia770, Sony 650, Sony 700(dead), Ebk(given)
Hi,

I use OpticBook scanner basic version (~250$ comes with Abby). At 300 dpi, b&w (which works much better for ocr than greyscale), jpg/tif, I do roughly 10 pages ( 5 dp sheets) per minute for a hc, 14 pages for a pb, watching a movie while scanning.

The results are pretty good, very few black spots or shadows as long as you press the book properly (depends from book to book and from where you are in the book). The book is not damaged in any way by scanning.

I always count 10' per 100 pages for scanning a hc/tp. After that, the ocr is done by the pc so no time wasted.

Personally I complement the ocr with a transformation of the scan in Nokia 770 size (480x800) jpg's page to screen; using 2 free windows programs (xnview, rename master for cropping, renaming, resizing) and a html template for embedding the jpg's so are readable by Fbreader, I do a scanned book in about 20-30' independent of size since everything is mostly a matter of organization. As a jpg/html it takes about 35kb/page at appropriate image quality, and it's very nicely and fast readable on Nokia. I complement reading the image book with the ocr, but overall I prefer the image reading on Nokia and the ocr on Ebks1150.

Liviu
Liviu_5 is offline   Reply With Quote
Advert
Old 08-16-2006, 04:47 AM   #6
ElaHuguet
iLiad freak
ElaHuguet doesn't litterElaHuguet doesn't litterElaHuguet doesn't litter
 
ElaHuguet's Avatar
 
Posts: 339
Karma: 243
Join Date: Apr 2006
Location: Mallorca, Spain
Device: iRex iLiad
How much does the Lego-block scanner cost? That takes a lot of man-hours from the scanning part of the project, I know (which you could then spend part of on OCRing more carefully).

Last edited by ElaHuguet; 08-16-2006 at 05:49 AM. Reason: Add link
ElaHuguet is offline   Reply With Quote
Old 08-17-2006, 09:59 AM   #7
Bob Russell
Recovering Gadget Addict
Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.
 
Bob Russell's Avatar
 
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
I'm really impressed with the OptiBook 3600 Scanner solution with the Abby OCR software included. If it's really about $250, I have to wonder if there's any other solution that can compare for capability and price.

Anyone else have an comparable alternative?

Thanks to those who have posted some great info in this and other threads like...
https://www.mobileread.com/forums/showthread.php?t=6478, and
https://www.mobileread.com/forums/showthread.php?t=7329
Bob Russell is offline   Reply With Quote
Old 09-07-2006, 01:14 PM   #8
Bob Russell
Recovering Gadget Addict
Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.
 
Bob Russell's Avatar
 
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
I wonder about the possibility to simply take pictures with some of the new low-priced, but high resolution digital cameras. I know it's expensive if you have a top of the line digital camera and an automatic page turner, but maybe it's practical with a decent camera and some OCR software (who knows, maybe even the free software released recently by Google?).

Even camera phones are supposed to be enough resolution for the ScanR free OCR service, which sends back emails after you send them a picture of text.

Anyone doing this currently?
Bob Russell is offline   Reply With Quote
Old 09-25-2006, 01:07 PM   #9
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
there's no better combination than the optic3600 and finereader.
if you want to do the scanning and the o.c.r. yourself, that is...

but since you specified an "out of copyright" book, you should
look around cyberspace to see if it has already been scanned.
google now has 100,000+ done, and is doing more every day.

if you can get the scans from somewhere, then the easiest
thing to do is to wrap them into a .pdf and just start reading.

of course, the text from such a "book" cannot be _searched_,
or _copied_, or _resized_ for greater readability, nor can it be
_reflowed_ so as to better fit varying screensizes. but if all that
doesn't bother you, then there's no reason to do any more work.

and remember that if you got the scans from google, you can
always return to google whenever you want to search the book.
plus, if you want to copy the text from a page or two, you can
o.c.r. just those page-images; you don't need the whole book.
so you might be able to live comfortably with those limitations.

but if you do want to do the o.c.r., you should know that it is
_not_ that difficult to clean the results and format the e-book.

i'll be posting some messages to the "bookpeople" listserve
this week that walk people through the process with an actual
scan-set that i downloaded from google. not only that, but
the university of michigan is now posting the o.c.r. _results_
on their site, so you can scrape their actual o.c.r. output too,
which means that you don't even have to do the o.c.r yourself.

when i post my messages, i'll come here and give you the url's.

-bowerbird
bowerbird is offline   Reply With Quote
Old 09-25-2006, 01:41 PM   #10
Bob Russell
Recovering Gadget Addict
Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.Bob Russell ought to be getting tired of karma fortunes by now.
 
Bob Russell's Avatar
 
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
Thanks bowerbird!

Someone mentioned that the inner page margins in many books are too small to get a good scan, didn't they? Have you had that problem also? Which books? How much margin do you need?
Bob Russell is offline   Reply With Quote
Old 09-26-2006, 12:23 PM   #11
ath
Addict
ath doesn't litterath doesn't litter
 
Posts: 222
Karma: 110
Join Date: Jun 2006
Location: Malmo, Sweden
Device: iLiad, Sony PRS-505, Kindle Paperwhite & Oasis
Quote:
Originally Posted by bowerbird
of course, the text from such a "book" cannot be _searched_,
or _copied_, or _resized_ for greater readability, nor can it be
_reflowed_ so as to better fit varying screensizes. but if all that
doesn't bother you, then there's no reason to do any more work.
ABBYY has something called PDF Transformer 2 Pro: sounds as if it
is just the thing for people who don't want to do anything else
but make scanned image PDF's searchable.

Don't know how well it works myself -- just saw the press release.
ath is offline   Reply With Quote
Old 10-03-2006, 04:00 AM   #12
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
abbyy's transformer is pc-only, so i've not been able to try it,
but it certainly struck me as a worthwhile piece of software,
since (if my memory serves correctly), it was just 50 bucks.

-bowerbird

p.s. oops. new version, with a new price -- now $100,
which takes it out of the league of "hey what can it hurt?"
and into the league of "this had better work". however...
since they offer a free demo version, why not try it out?
bowerbird is offline   Reply With Quote
Old 10-03-2006, 04:04 AM   #13
bowerbird
Banned
bowerbird has been very, very naughtybowerbird has been very, very naughtybowerbird has been very, very naughty
 
Posts: 269
Karma: -273
Join Date: Sep 2006
Location: los angeles
oh yeah, i've been posting messages to the bookpeople listserve
about my experiment to scrape o.c.r. text-files from umichigan
and transform 'em into an electronic-book. go to the 2006 index:
> http://onlinebooks.library.upenn.edu/webbin/bparchive
and search for "feedback to umichigan" for my series of posts,
which will conclude with "part 7" in the next day or two...

-bowerbird
bowerbird is offline   Reply With Quote
Old 10-06-2006, 07:59 PM   #14
Bob Hoswell
Junior Member
Bob Hoswell began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Oct 2006
Hi everybody,

I am a new member and an old hand at a OCR , have been doing it sends1994 , started with OmniPage direct, then TextBridge pro, OmniPage 11, recently discovered ABBYY, software and have found it to be the best OCR program. Yet. And at just bought my fourth scanner, The comments about damaging books, while scanning them on a flat ed scanner, I have encountered many times I have found no solution. If you too hard on the book, your wrist, damaging it. But if you do not press down hard, you end up with shadows on the image, in the posting. There is mention of this new scanner that can scan books were about damage and them and no shadows. Anybody had been using the scanner and what are the results. You mentioned in the post hand scanners, and they still around?.

Regards,

Bob
Bob Hoswell is offline   Reply With Quote
Old 10-07-2006, 03:51 AM   #15
ath
Addict
ath doesn't litterath doesn't litter
 
Posts: 222
Karma: 110
Join Date: Jun 2006
Location: Malmo, Sweden
Device: iLiad, Sony PRS-505, Kindle Paperwhite & Oasis
Quote:
Originally Posted by Bob Hoswell
There is mention of this new scanner that can scan books were about damage and them and no shadows. Anybody had been using the scanner and what are the results.
The idea is fairly old -- Xerox used to have a scanner like that, where the scanning area went all the way out to the edge of the scanner, and allowed books to 'hang' off the edge, while the inside page was scanned.

The bonus is that gutter effects disappear, and pages scan flat, even when the book is very stiffly bound. That is a considerable advantage.

The disadvantage is that you will be doing twice as many scanning moments, and as each of them carry a risk for folding a page, that risk increases. Scanning time increases also, but that may perhaps be offset by better OCR results, and less correction work later.

And of course there are always books where the binding is slipping or deteriorating: they won't stand even this much handling.
ath is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Firmware Update Simplest method to register your Kindle w/2.5.x firmware, if you live outside US Nifty Amazon Kindle 153 12-08-2023 12:08 AM
Anyone Using An eReader on Gym Equipment? chazcop General Discussions 21 08-03-2010 08:53 PM
reading methods? ahammer Lounge 1 07-19-2009 02:51 PM


All times are GMT -4. The time now is 07:57 PM.


MobileRead.com is a privately owned, operated and funded community.