Quote:
Originally Posted by user
|
The approach to testing Linux OCR programs in the link you provided is rather simplistic. It does not take into account that the scanned pages are frequently not "clean" and come with lower resolutions (like fax pages), they contain layouts with blocks of text and images, the text may be in different colors (the author didn't even test italic and bold print for his few selected fonts), the 8 point and even 6 point fonts are not uncommon in scanned originals, the original lines of text may be skewed and even curved, etc. One accented word he used to test the OCR performance (presumably with the French and English languages selected) does not give a clue about recognizing text with a number of different languages in one document (I frequently recognize by OCR more than six languages in old books - Latin, English, French, German, Cyrillic, various Slavic languages, plus their old varieties). While providing some hints on the performance of a few Linux OCR programs, the tests have little to do with real life situations. I am afraid that Linux OCR programs are about 20 years behind the commercial Windows OCR programs, like Finereader 8. It is not about programming, it is about recognition, cleaning, deskewing, binarization and other algorithms.
To give you an idea how correct is the recognition using photoscanning and Finereader 8, I used the line of text as in the tests above, and added 8 point plus italic fonts. Below are the results.
Number of characters (with spaces) 1852
1. Word document converted by software to jpg at 300 dpi (no scanning), picture attached - 6 uncertain characters (no errors)
2. Word document converted by software to jpg at 400 dpi (no scanning) - 16 uncertain characters
3. Paper printout of the Word document photoscanned at about 300 dpi, picture attached - 17 uncertain characters
4. Paper printout of the Word document photoscanned at about 300 dpi and binarized by ClearImage demo, picture attached - 11 uncertain characters
While most uncertain characters were recognized correctly, some where wrong - mostly wrong recognition of characters in the articles "the"). The 1% uncertain/error margin will make little difference when searching for a combination of words in an OCRred and indexed files, however it would make a difference for presenting the document as a copy of the original. That is why for the output of OCR the best option is to use pdf picture over text.
As strange as it may look, 400 dpi resolution did not improve the recognition but binarization did.
In the photoscanned picture you may notice that corners are a little darker that the rest of the page. This is a feature of camera zoom. To avoid dark corners you just use a lower zoom. The higher is the resolution of camera sensor, the lower is the zoom required to get the photo at a given resolution for a page of a given size.