MobileRead Forums - View Single Post - do-it yourself repro v-cradle for paper books

ereszet · 10-16-2007, 10:29 AM

Quote:

Originally Posted by user

although 98% OCR success is superb, what if you postprocess the images with a software? do you reach a 100%? does postprocessing increase the OCR success in both scans and photos?

I am sure your work will be referenced from all those who try to digitize using a camera, and maybe in the futures scanners will extinct for home/small-office digitization

1. Please note, that my test was a kind of a "destructive test" (like crushing your car against the wall to see damages). The title sleeve was under a reflective plastic cover, the text was miniscule and in ten languages of which two were not installed in my OCR engine, the pictures (flags) were so small that could easily be taken as a text in color, the background was in color, etc. It is difficult to imagine similar case in practice. That the error rate with my camera setup was just 2% was amazing. One should examine the text after OCR (ten languages) by reading it to get the real taste of what Finereader is capable of doing with moderately good photos (no white balance) of extremely difficult layouts.

On average you can expect a 1% error rate (even with picture perfect photos/scans of text), unless you have hundreds of similar pages and you train the program (dictionary, patterns) with the first dozen or so pages to make sure that the rest is recognized error free.

In my experience, there is always an error here and there. Even if you scan the same document twice under identical conditions, the errors will appear in different places. Apart from sophisticated algorithms, some kind of a huge contextual database and artificial intelligence is the future. With handwritten text there are different recognition algorithms based on the movement of your pen rather then patterns.

2. Preprocessing is not required if you take the photos in proper lighting condtions and make sure that they are rectangular (removing of black borders may be required when the book size proportions are different from the photo frame). Batch contrast improvement may be useful for photos taken in dim lighting (in a hotel on your business trip). Deskewing and despeckling may be useful for faxes that you receive (a lot of people do not clean their fax machines). Correcting perspective and straightening text lines may be required for photos taken by hand or photos of double pages of thick books. Binarization will help you reduce the size of resultant pdfs. All this can be done before using Finereader and some preprocessing can be done by Finereader (but Finereader uses default parameters that may not be ideal with your specific documents/books). Finally, sometimes you may need to remove black blobs that are due to non uniform lighting conditions. When you binarize color or grey pictures with some areas where text and background are barely discernible, you will get the text surrounded by black lines, smudges and blobs. By changing the binarization parameters you can get less of that but the danger is that the text will disappear as well. The best method I found is to recognize blocks of text and pictures in Finereader and save all the images with blocks of text and picture only. Then, you load the images back to the Finerader and check the thumbnails. You will easily see those images that still have black spots and you can remove them with the eraser. After that you save again your images as blocks, load them back, recognize the text, save the final result in whatever format you want, and add white margins afterwords if you need them.

Please do not be put off by the procedures I described above. If you use my setup with adjustable v-cradle you will shoot 10 perfect photos a minute (a book in one hour - with automatic camera shooting, 6 seconds lapse time) and you can directly OCR it. If some pages go wrong (e.g. you turned two pages together instead of one, you moved the book a little and the binding shows in some pictures) you just take again the photos of those pages only.