01-23-2010, 01:25 PM | #1 |
Enthusiast
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
|
Current state of OCR/scanner tech?
I have a number of Very Old Books which I'd like to scan non-destructively (these are collectible editions, long OOP and OOC, which I'd like to preserve for my own records and to contrib to PG).
Looking around at the state of scanners etc, my halfbaked assessment is this: 1) automated book scanning requires very expensive industrial machines. 2) artisanal book scanning requires a lot of time and effort either using a standard or Opticbook (better for fragile old editions) style bed scanner or using a digital camera in some kind of offset stand (and possibly post processing 100s of images for contrast, ugh). 3) OCR software at present is either (a) very costly or (b) very cheesy. there doesn't seem to be any really good GPL OCRware. (why is that I wonder? we have all kinds of other GPL/CC software that's often better than the commercial flavour). either way, it's also time consuming to tend the OCR process and then fix the 5 to 20 percent error rate (depending on the cheesiness of the OCR). so... has anyone actually dared to compare -- is it faster to set up a copy-holder stand and (if you are a fast typist) just re-type the book content? it seems an arduous task yet I do wonder if it would take any more time than the lengthy, high-tech procedure of book scanning and OCR. have I got a grip on the basic situation or am I missing some recent and exciting development like a blazing new GPL OCR app? |
01-23-2010, 06:16 PM | #2 |
Booklegger
Posts: 1,801
Karma: 7999816
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
|
See my comments on the Scanner Recommendation thread. I would not call the Abbyy FineReader software cheesy, even if it doesn't run under Linux
Sorry, I didn't figure out how to reference the thread by linky... |
01-23-2010, 07:44 PM | #3 | |
Enthusiast
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
|
no support for osx/linux? yep, cheesy :-)
Quote:
More seriously though... I do wonder why the free software community, which has produced GIMP and other very viable alternatives to ransomware for other applications, has not managed to produce good OCR. That seems worth a bit of research just as an interesting question in its own right. |
|
01-23-2010, 08:19 PM | #4 |
Grand Sorcerer
Posts: 11,234
Karma: 34817224
Join Date: Jan 2008
Device: Pocketbook
|
Lack of interest in the Linux community, I suppose.
The OS preference are basically passe' ,in my perspective. I have a job to do, how effect is the answer, and can I afford it? A good scan setup for reflowable text will cost around $600 dollars. And that's to buy it for single purpose use. What do you get for the money? ACER REVO mini PC with windows OS - $200 Optiscan 3600 scanner - $300 AABBY Express 10.0 ( I run 9.0 on my setup) - $60 shipping. - $40 You don't have to do anything else with the Windows PC, treat it as a embedded machine. (It's 15 cm x 15 cm by 4 cm, i.e smaller that a typical hardback) You'll get a defect rate of 1 for every 4-5 pages for hardbacks, 1-2 per page for paperbacks, although it will vary on font type and size. The snap scan method is for PDF's and you lose reflowability with it. It causes problems with texts that don't fit the screen. |
01-23-2010, 10:02 PM | #5 | |
MR Drone
Posts: 1,613
Karma: 15612282
Join Date: Oct 2007
Location: DRONEZONE
Device: PB360+, Huawei MP5, Libra H20
|
Quote:
Recommend Vuecsan: http://www.hamrick.com/ Works on Linux, Apple, or Windoze. 40USD. I have used it for several years.. |
|
01-24-2010, 03:09 PM | #6 | |
Enthusiast
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
|
Quote:
Point well taken about the embedded-machine aspect (using a windoze laptop or palmtop as a controller, like the msdos machines that still run many CNC mills). I just don't have a few hundred bux to throw at the project; hoping to do it on the cheap with what I already have (Leopard, a laptop, a digicam, and coding skills). thanks for the thread! I hope some more folks will pop up and explain their own personal scanning/ocring setups. |
|
01-24-2010, 03:37 PM | #7 | |||||
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Quote:
Quote:
Quote:
Quote:
|
|||||
01-24-2010, 05:04 PM | #8 |
Evangelist
Posts: 412
Karma: 546196
Join Date: Mar 2009
Location: UK canal boat
Device: sony prs505, prs650, kobo Glo HD liseuses
|
I have done this for a couple of books, and decided that despite being a tolerably good touch-typist, the pain was too much. With respect to the time taken, I've found that it's the "editorial" processes required *after* acquiring a digital text that are time consuming. (layout, chapter management, images, front and end matter as well as proof-reading). I'll be experimenting with using my digital camera for the job in the not-too-distant future - I have some aged pbacks that *have* to be digitised before they disintegrate!
|
01-26-2010, 11:31 AM | #9 |
Enthusiast
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
|
I built Tesseract with no difficulty -- it works fine on its included test images, which probably doesn't say much. The stable 2.04 version can only read TIFF, but the slightly unstable v3, with some additional libraries, can read jpegs and other formats. Looks like a pretty good OCR engine, and I think this is where I'd put my effort if I get serious about scanning my old books.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Low budget scanner + OCR: Test and results | Madmanden | Workshop | 4 | 09-13-2010 01:37 AM |
OCR to use | pepak | Workshop | 17 | 05-26-2008 05:30 PM |
What is an OCR Cradle? | JackieFrost | Which one should I buy? | 4 | 05-21-2008 08:10 PM |