View Single Post
Old 10-10-2007, 01:36 PM   #60
ereszet
Zealot
ereszet has a complete set of Star Wars action figures.ereszet has a complete set of Star Wars action figures.ereszet has a complete set of Star Wars action figures.ereszet has a complete set of Star Wars action figures.
 
ereszet's Avatar
 
Posts: 118
Karma: 306
Join Date: Sep 2007
Device: Sony PRS-500 Archos 704 wifi
OCR testing


The approach to testing Linux OCR programs in the link you provided is rather simplistic. It does not take into account that the scanned pages are frequently not "clean" and come with lower resolutions (like fax pages), they contain layouts with blocks of text and images, the text may be in different colors (the author didn't even test italic and bold print for his few selected fonts), the 8 point and even 6 point fonts are not uncommon in scanned originals, the original lines of text may be skewed and even curved, etc. One accented word he used to test the OCR performance (presumably with the French and English languages selected) does not give a clue about recognizing text with a number of different languages in one document (I frequently recognize by OCR more than six languages in old books - Latin, English, French, German, Cyrillic, various Slavic languages, plus their old varieties). While providing some hints on the performance of a few Linux OCR programs, the tests have little to do with real life situations. I am afraid that Linux OCR programs are about 20 years behind the commercial Windows OCR programs, like Finereader 8. It is not about programming, it is about recognition, cleaning, deskewing, binarization and other algorithms.

To give you an idea how correct is the recognition using photoscanning and Finereader 8, I used the line of text as in the tests above, and added 8 point plus italic fonts. Below are the results.

Number of characters (with spaces) 1852

1. Word document converted by software to jpg at 300 dpi (no scanning), picture attached - 6 uncertain characters (no errors)
2. Word document converted by software to jpg at 400 dpi (no scanning) - 16 uncertain characters
3. Paper printout of the Word document photoscanned at about 300 dpi, picture attached - 17 uncertain characters
4. Paper printout of the Word document photoscanned at about 300 dpi and binarized by ClearImage demo, picture attached - 11 uncertain characters

While most uncertain characters were recognized correctly, some where wrong - mostly wrong recognition of characters in the articles "the"). The 1% uncertain/error margin will make little difference when searching for a combination of words in an OCRred and indexed files, however it would make a difference for presenting the document as a copy of the original. That is why for the output of OCR the best option is to use pdf picture over text.

As strange as it may look, 400 dpi resolution did not improve the recognition but binarization did.

In the photoscanned picture you may notice that corners are a little darker that the rest of the page. This is a feature of camera zoom. To avoid dark corners you just use a lower zoom. The higher is the resolution of camera sensor, the lower is the zoom required to get the photo at a given resolution for a page of a given size.
Attached Thumbnails
Click image for larger version

Name:	original.jpg
Views:	669
Size:	56.0 KB
ID:	6157   Click image for larger version

Name:	phtotoscanned.jpg
Views:	656
Size:	85.5 KB
ID:	6158   Click image for larger version

Name:	binarized.jpg
Views:	729
Size:	102.3 KB
ID:	6159  

Last edited by ereszet; 10-10-2007 at 03:07 PM.
ereszet is offline   Reply With Quote