Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 01-23-2010, 01:25 PM   #1
RootlessAgrarian
Enthusiast
RootlessAgrarian is on a distinguished road
 
RootlessAgrarian's Avatar
 
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
Current state of OCR/scanner tech?

I have a number of Very Old Books which I'd like to scan non-destructively (these are collectible editions, long OOP and OOC, which I'd like to preserve for my own records and to contrib to PG).

Looking around at the state of scanners etc, my halfbaked assessment is this:

1) automated book scanning requires very expensive industrial machines.

2) artisanal book scanning requires a lot of time and effort either
using a standard or Opticbook (better for fragile old editions) style bed scanner or
using a digital camera in some kind of offset stand (and possibly post processing 100s of images for contrast, ugh).

3) OCR software at present is either (a) very costly or (b) very cheesy. there doesn't seem to be any really good GPL OCRware. (why is that I wonder? we have all kinds of other GPL/CC software that's often better than the commercial flavour). either way, it's also time consuming to tend the OCR process and then fix the 5 to 20 percent error rate (depending on the cheesiness of the OCR).

so... has anyone actually dared to compare -- is it faster to set up a copy-holder stand and (if you are a fast typist) just re-type the book content? it seems an arduous task yet I do wonder if it would take any more time than the lengthy, high-tech procedure of book scanning and OCR.

have I got a grip on the basic situation or am I missing some recent and exciting development like a blazing new GPL OCR app?
RootlessAgrarian is offline   Reply With Quote
Old 01-23-2010, 06:16 PM   #2
pholy
Booklegger
pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.
 
pholy's Avatar
 
Posts: 1,789
Karma: 7999034
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
See my comments on the Scanner Recommendation thread. I would not call the Abbyy FineReader software cheesy, even if it doesn't run under Linux

Sorry, I didn't figure out how to reference the thread by linky...
pholy is offline   Reply With Quote
Old 01-23-2010, 07:44 PM   #3
RootlessAgrarian
Enthusiast
RootlessAgrarian is on a distinguished road
 
RootlessAgrarian's Avatar
 
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
no support for osx/linux? yep, cheesy :-)

Quote:
Originally Posted by pholy View Post
See my comments on the Scanner Recommendation thread. I would not call the Abbyy FineReader software cheesy, even if it doesn't run under Linux

Sorry, I didn't figure out how to reference the thread by linky...
yeah well, WinDoze-only-ware is categorised in my personal field notebook as cheesy just an idiosyncratic bias.

More seriously though... I do wonder why the free software community, which has produced GIMP and other very viable alternatives to ransomware for other applications, has not managed to produce good OCR. That seems worth a bit of research just as an interesting question in its own right.
RootlessAgrarian is offline   Reply With Quote
Old 01-23-2010, 08:19 PM   #4
Ralph Sir Edward
Gentleman & Cynic
Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.Ralph Sir Edward ought to be getting tired of karma fortunes by now.
 
Ralph Sir Edward's Avatar
 
Posts: 5,587
Karma: 13181086
Join Date: Jan 2008
Location: 5 generation native Texan
Device: BeBook/Openinkpot, CYbook 3rd gen awaiting RTF software upgrade
Lack of interest in the Linux community, I suppose.

The OS preference are basically passe' ,in my perspective. I have a job to do, how effect is the answer, and can I afford it?

A good scan setup for reflowable text will cost around $600 dollars. And that's to buy it for single purpose use. What do you get for the money?

ACER REVO mini PC with windows OS - $200
Optiscan 3600 scanner - $300
AABBY Express 10.0 ( I run 9.0 on my setup) - $60

shipping. - $40

You don't have to do anything else with the Windows PC, treat it as a embedded machine. (It's 15 cm x 15 cm by 4 cm, i.e smaller that a typical hardback)

You'll get a defect rate of 1 for every 4-5 pages for hardbacks, 1-2 per page for paperbacks, although it will vary on font type and size.

The snap scan method is for PDF's and you lose reflowability with it. It causes problems with texts that don't fit the screen.
Ralph Sir Edward is offline   Reply With Quote
Old 01-23-2010, 10:02 PM   #5
hidari
MR Drone
hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.hidari ought to be getting tired of karma fortunes by now.
 
hidari's Avatar
 
Posts: 1,604
Karma: 15260410
Join Date: Oct 2007
Location: DRONEZONE
Device: OPUS/PB360,Nexus 7,GzONE, Kobo Mini
Quote:
Originally Posted by RootlessAgrarian View Post
yeah well, WinDoze-only-ware is categorised in my personal field notebook as cheesy just an idiosyncratic bias.

More seriously though... I do wonder why the free software community, which has produced GIMP and other very viable alternatives to ransomware for other applications, has not managed to produce good OCR. That seems worth a bit of research just as an interesting question in its own right.

Recommend Vuecsan:

http://www.hamrick.com/


Works on Linux, Apple, or Windoze. 40USD. I have used it for several years..
hidari is offline   Reply With Quote
Old 01-24-2010, 03:09 PM   #6
RootlessAgrarian
Enthusiast
RootlessAgrarian is on a distinguished road
 
RootlessAgrarian's Avatar
 
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
Quote:
Originally Posted by hidari View Post
Recommend Vuecsan:

http://www.hamrick.com/


Works on Linux, Apple, or Windoze. 40USD. I have used it for several years..
Googling around I have discovered that GOCR is still alive, and that Google (the place I would like to have spent my career if I could go back and start over!) is distributing Tesseract. both are cmd-line driven and seem well worth a look and a sniff.

Point well taken about the embedded-machine aspect (using a windoze laptop or palmtop as a controller, like the msdos machines that still run many CNC mills). I just don't have a few hundred bux to throw at the project; hoping to do it on the cheap with what I already have (Leopard, a laptop, a digicam, and coding skills).

thanks for the thread! I hope some more folks will pop up and explain their own personal scanning/ocring setups.
RootlessAgrarian is offline   Reply With Quote
Old 01-24-2010, 03:37 PM   #7
pepak
Fanatic
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 594
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-505
Quote:
Originally Posted by RootlessAgrarian View Post
2) artisanal book scanning requires a lot of time and effort [...] Opticbook
It's not quick and easy by no means, but it doesn't take all that much time or effort, really. It depends on how collectable your books are and how much you value them, and how much free time you have. Generally, I tend to do some 300 pages per hour while browsing the net or reading another book, so most of my books can easily get scanned over one evening.

Quote:
3) OCR software at present is either (a) very costly or (b) very cheesy.
Don't forget (c) both.

Quote:
it's also time consuming to tend the OCR process and then fix the 5 to 20 percent error rate (depending on the cheesiness of the OCR).
Generally, I do my proofing while reading the book, never in an OCR software. It tends to be a lot faster (you would still need to read the spell-checked book anyway), definitely more enjoyable, and gives far better unfixed-error-rate (because an automatic spellcheck can only catch certain types of errors, but can't handle context, grammar and typography yet).

Quote:
is it faster to set up a copy-holder stand and (if you are a fast typist) just re-type the book content?
No, but I did download a book that was typed by someone and after reading through several chapters I decided to make a new scan rather than go through the kind of errors the typing process created.

Quote:
I do wonder if it would take any more time than the lengthy, high-tech procedure of book scanning and OCR.
Depends. I am sure that if an uninitiated person tried it, the typing might possibly be faster. As soon as you get some experience scanning and proofing, I am damn sure typing will prove far more costly in time, effort and quality of the result.
pepak is offline   Reply With Quote
Old 01-24-2010, 05:04 PM   #8
alecE
Addict
alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.
 
alecE's Avatar
 
Posts: 399
Karma: 546196
Join Date: Mar 2009
Location: UK canal boat
Device: sony prs505, prs650 liseuses
Quote:
Originally Posted by RootlessAgrarian View Post
...so... has anyone actually dared to compare -- is it faster to set up a copy-holder stand and (if you are a fast typist) just re-type the book content?...
I have done this for a couple of books, and decided that despite being a tolerably good touch-typist, the pain was too much. With respect to the time taken, I've found that it's the "editorial" processes required *after* acquiring a digital text that are time consuming. (layout, chapter management, images, front and end matter as well as proof-reading). I'll be experimenting with using my digital camera for the job in the not-too-distant future - I have some aged pbacks that *have* to be digitised before they disintegrate!
alecE is offline   Reply With Quote
Old 01-26-2010, 11:31 AM   #9
RootlessAgrarian
Enthusiast
RootlessAgrarian is on a distinguished road
 
RootlessAgrarian's Avatar
 
Posts: 48
Karma: 62
Join Date: Jan 2010
Device: HANLIN V3
I built Tesseract with no difficulty -- it works fine on its included test images, which probably doesn't say much. The stable 2.04 version can only read TIFF, but the slightly unstable v3, with some additional libraries, can read jpegs and other formats. Looks like a pretty good OCR engine, and I think this is where I'd put my effort if I get serious about scanning my old books.
RootlessAgrarian is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Low budget scanner + OCR: Test and results Madmanden Workshop 4 09-13-2010 01:37 AM
OCR to use pepak Workshop 17 05-26-2008 05:30 PM
What is an OCR Cradle? JackieFrost Which one should I buy? 4 05-21-2008 08:10 PM


All times are GMT -4. The time now is 04:26 AM.


MobileRead.com is a privately owned, operated and funded community.