09-11-2010, 12:35 PM | #1 | |||
Enthusiast
Posts: 36
Karma: 308
Join Date: Aug 2009
Location: Denmark
Device: Kindle 3
|
Low budget scanner + OCR: Test and results
Introduction:
I had a bit of free time and decided to do a little impromptu test of a very low-budget book scanner and how to get the best OCR results. Methodology: Now, since I don't own a nice DIY book scanner - or even a decent digicam - I decided to test a worst case scenario for book scanning: A mobile phone for capture and no even lighting. And I will be capturing both sides of the book in a single image, so distortion of the pages will occur. The light was ambient daylight, coming in from the side. The camera is a Nokia 5800, which is a 3,2 mpixel camera if I'm not mistaken. The pictures are not exactly optimal for book scanning. Software: I tested Scan Tailor and Snapter to process the images and ABBYY FineReader 10 Pro to OCR. Both Snapter and FineReader were trials, but I think their performance is the same as the full versions. For those who don't know, Snapter is a $49 software that is supposed to process camera images and make them look more like they have been scanned on a regular flatbed scanner - a bit like Scan Tailor. So in theory it should provide better images to the OCR software. The test: I grabbed the first book on my shelf; it was a Danish translation of Faulkner's Light in August. Since I was just doing this to humour myself I didn't think ahead and use an English book, so that you guys could read the sample image too. But it might turn out to be a blessing in disguise, because Danish has three extra letters - æ, ø and å. And as we'll see later, the OCR software had a bit of a hard time correctly recognizing the letter 'ø' in this low resolution source image. So this will make the test even more of a 'worst case scenario' than most of you would normally encounter. The workflow was: Camera image --> Scan Tailor / Snapter --> ABBYY FineReader I would try a few different settings to see if it would make a difference in OCR errors. The settings will be clarified in the next section. Since FineReader has the option to adjust images before OCR, I tested that too. I only used the three 'photo correction' options (straighten text lines, remove motion blur and reduce ISO noise). Finally I ran the source photo through OCR as is, without any extra processing from either Scan Tailor or Snapter. I only tested one page due to the trial of FineReader limiting saves to 1 page at a time (and a 50 page total limit). And the resulting error counting is merely an approximation, so please take the results with a grain of salt - I don't claim this to be the be-all end-all test. Results: Original source image: Too many errors to be usable Original source image + FineReader adjustments: 40 errors Scan Tailor: Too many errors to be usable Scan Tailor + FineReader: 26 errors Snapter out-of-box: 36 errors Snapter out-of-box + FineReader: 21 errors Snapter 400 dpi + sharpening + contrast: 38 errors Snapter 400 dpi + sharpening + contrast + FineReader: 22 errors Snapter 400 dpi + auto color + sharpening: 37 errors Snapter 400 dpi + auto color + sharpening + FineReader: 20 errors Snapter 400 dpi + greyscale + contrast: 35 errors Snapter 400 dpi + greyscale + contrast + FineReader: 18 errors A little note on speed: The OCR felt noticeably faster on the Snapter prepared images than the raw source image. This might be a factor when doing many pages (although then you'd have to spend time on Snapter as well). Conclusion: The source image on its own was too bad to get a usable OCR. When I applied the FineReader photo corrections it became much better with 40 errors (meaning I had to manually delete/replace 40 characters). Scan Tailor on its own was also too bad to get a usable OCR. With FineReader's help the result had 26 errors, not too bad at all. The best result came from using Snapter with contrast boost, but without sharpning applied. On its own the result had 35 errors. With FineReader's help the result was 18 errors - a lot of which were due to 'ø' being recognized as 'o'. The end result was surprisingly usable considering the low-resolution, distorted side-lighted source image. So if this is as bad as it gets, then I imagine one could get a quite good result by just using a tripod, overhead lighting and a decent camera. This would be a nice (and fast) low-budget DIY book scanner, capable of providing an acceptable OCR result without the need for much text editing. Hope you enjoyed my little test. I can only encourage anyone interested to try it out yourself. You might be surprised at the results. Source image: Sample output - OCR from source image: Quote:
Sample output - OCR from source image + FineReader: Quote:
Sample output - OCR from Snapter + FineReader: Quote:
|
|||
09-12-2010, 10:41 AM | #2 |
Booklegger
Posts: 1,801
Karma: 7999816
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
|
Just a quick question - Did you tell the Abbyy software you had a Danish book? Can it figure that out itself? I think Danish is a supported language; then it would have checked for the odd vowels. I'm curious to know how much difference that would make.
|
Advert | |
|
09-12-2010, 10:48 AM | #3 |
Enthusiast
Posts: 36
Karma: 308
Join Date: Aug 2009
Location: Denmark
Device: Kindle 3
|
Yeah, it has support for Danish and figured out the language itself. The problem with 'ø' is that it looks very much like 'o' and if the source image isn't clear enough that can become a problem.
|
09-12-2010, 10:10 PM | #4 |
Booklegger
Posts: 1,801
Karma: 7999816
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
|
Thanks, M. - I'll have to give this a try using my two Olympi (-uses?) - a 5MP point&shoot and a 10 MP E-520. On the other hand, it looks like a book has to sit flat open for this to work well. I'll probably keep the OpticBook scanner that came with my copy of Abbyy Sprint (of course, it stays home, and my cameras can go with me anywhere) Very interesting technique!
|
09-13-2010, 01:37 AM | #5 |
Enthusiast
Posts: 36
Karma: 308
Join Date: Aug 2009
Location: Denmark
Device: Kindle 3
|
Just for fun I tried capturing the best picture I could with the same setup. I ran the picture through OCR (ABBYY FineReader 10 again). Tried Scan Tailor, Snapter and original photo + with and without FineReader adjustments.
My (very quick) findings were that Scan Tailor provided the worst result for OCR. Snapter was second, and the original photo was best. This time around using the FineReader adjustments didn't necessarily result in an improvement. Some times it fixed an error, other times it resulted in new errors. There were very few errors in the OCR using the original image without any postprocessing. I didn't count, but I'd say <5 for the page. You could certainly use it out-of-box for non-archival purposes (say reading a novel on an ereader). Again, this is from a handheld mobile phone and no extra lighting. With just a real digital camera and a tripod I would expect near perfect OCR results out-of-box. It's a bit slower though, since you only capture one page at a time. But you could probably combine the speed with test 1 and the accuracy with test 2 by using a plate of glass or something (or even your fingers) to make sure the pages are flat and uniform. |
Advert | |
|
Tags |
ocr |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Current state of OCR/scanner tech? | RootlessAgrarian | Workshop | 8 | 01-26-2010 11:31 AM |
In the US until 22nd and 450$ budget for an ereader... What should I buy? | giom | Which one should I buy? | 14 | 01-25-2010 02:03 PM |
Buying a E-reader on a budget: What choices are there under $200? | AngelHazard | Which one should I buy? | 14 | 01-22-2010 08:29 PM |
low-budget eReader with minimum capabilities | USBoM | Which one should I buy? | 7 | 03-08-2009 11:47 AM |
Other Non-Fiction De Morgan, Augustus: A Budget of Paradoxes, Volume II (of II), v1 25 Aug 2008 | Madam Broshkina | BBeB/LRF Books | 0 | 08-25-2008 10:47 PM |