MobileRead Forums - View Single Post - do-it yourself repro v-cradle for paper books

ereszet · 10-22-2007, 06:41 AM

OCR success rate of 99% is acceptable in a number of applications (like book scanning) and unacceptable in other applications (like bank documents).
For a "paperless office", which today in most cases is based on picture true copies and some keywords describing the documents, an index of 99% properly recognized words is more than one could wish for (and some of the remaining 1% words are just uncertain - not necessarily errors). For special applications (like musical notes) there are special OCR programs.

OCRing books is important for two different reasons:
either to save the result in a variety of text formats (i.e. Word or html) in order to work on the text later (e.g. to convert to lrf text format)
or to save the text under the page image in order to have a perfect copy of the original and to index the words.

In the first case 1% error rate does not prevent you from reading the book. As a matter of fact, the Discovery TV channel proves in their spots that you need only the first and last letters of the word in context to make the right associations in your brain. Gutenberg project books (see Websters encyclopedia of 1911) contain a lot misrecognized letters but are usable anyway.

In the second case, you are able to search thousands of books for some combination of words, even using wildcards, e.g. you can search for all the pages with words "photo scanning" or "photo* scan*") and the displayed pages will be exactly like the original. If you miss a page or two because of ocr errors, you can live with it. I use it for genealogical reseach even with 17th century books (where e.g. "s" is printed as "f"), and I can still find a lot of references to what I search for.

As for the ideal parameters for scanning before OCR, there is simply no such requirement or need. Twenty years ago, when OCR was based on simple pattern matching, there were very strict rules for the resolution and recognizable fonts. Today, OCR algorithms are much more sophisticated and can cope with a plethora of languages and font shapes. However, they cannot make an informed decision, whether an object (like a big title or a small icon or a musical note) is a picture or text. Therefore, warnings and errors are unavoidable. I stress it again that even the perfect 300 dpi picture of the text you convert from Word directly (without printing to paper and scanning) will have OCR errors and warnings. The only way to improve on that is to teach (train) the OCR program with the first few pages of the book, so that it can easily tell the difference between e.g. "n" and "h" in the fonts used to print the book.

Quality of scans (focus, uniform lighting, proper positioning) is more important than the resolution, whether you plan to OCR the scan or not. Finereader can cope with resolutions from 96 dpi up. The new Finereader 9 can automatically set the proper brightness of greyscale images for scanning but for the camera you need to set the parameters yourself. Because every book is different you should always take some trial photos of a few pages at different camera settings and preview the results (you may scale the picture up to see the details). If your eye is happy with the results, your OCR program should be happy as well.

Finally, as books are concerned, scanners are no match to photocopying, because of limited size, slow operation, deformation of scanned book pages, damage to books, etc.

BTW. While Finereader 9 allows for 2 Mpixels camera shots, it specifically says in its guide that cellphone cameras are not fit for the purpose. I wonder why you try all kind of approaches instead of using a solution with which I started this thread, and which I tested on tens of thousands of book and document pages. Just try it, even with the "quick and dirty" cardboard v-cradle.

10-22-2007, 06:41 AM	#97
ereszet Zealot Posts: 118 Karma: 306 Join Date: Sep 2007 Device: Sony PRS-500 Archos 704 wifi	OCR success rate of 99% is acceptable in a number of applications (like book scanning) and unacceptable in other applications (like bank documents). For a "paperless office", which today in most cases is based on picture true copies and some keywords describing the documents, an index of 99% properly recognized words is more than one could wish for (and some of the remaining 1% words are just uncertain - not necessarily errors). For special applications (like musical notes) there are special OCR programs. OCRing books is important for two different reasons: either to save the result in a variety of text formats (i.e. Word or html) in order to work on the text later (e.g. to convert to lrf text format) or to save the text under the page image in order to have a perfect copy of the original and to index the words. In the first case 1% error rate does not prevent you from reading the book. As a matter of fact, the Discovery TV channel proves in their spots that you need only the first and last letters of the word in context to make the right associations in your brain. Gutenberg project books (see Websters encyclopedia of 1911) contain a lot misrecognized letters but are usable anyway. In the second case, you are able to search thousands of books for some combination of words, even using wildcards, e.g. you can search for all the pages with words "photo scanning" or "photo* scan*") and the displayed pages will be exactly like the original. If you miss a page or two because of ocr errors, you can live with it. I use it for genealogical reseach even with 17th century books (where e.g. "s" is printed as "f"), and I can still find a lot of references to what I search for. As for the ideal parameters for scanning before OCR, there is simply no such requirement or need. Twenty years ago, when OCR was based on simple pattern matching, there were very strict rules for the resolution and recognizable fonts. Today, OCR algorithms are much more sophisticated and can cope with a plethora of languages and font shapes. However, they cannot make an informed decision, whether an object (like a big title or a small icon or a musical note) is a picture or text. Therefore, warnings and errors are unavoidable. I stress it again that even the perfect 300 dpi picture of the text you convert from Word directly (without printing to paper and scanning) will have OCR errors and warnings. The only way to improve on that is to teach (train) the OCR program with the first few pages of the book, so that it can easily tell the difference between e.g. "n" and "h" in the fonts used to print the book. Quality of scans (focus, uniform lighting, proper positioning) is more important than the resolution, whether you plan to OCR the scan or not. Finereader can cope with resolutions from 96 dpi up. The new Finereader 9 can automatically set the proper brightness of greyscale images for scanning but for the camera you need to set the parameters yourself. Because every book is different you should always take some trial photos of a few pages at different camera settings and preview the results (you may scale the picture up to see the details). If your eye is happy with the results, your OCR program should be happy as well. Finally, as books are concerned, scanners are no match to photocopying, because of limited size, slow operation, deformation of scanned book pages, damage to books, etc. BTW. While Finereader 9 allows for 2 Mpixels camera shots, it specifically says in its guide that cellphone cameras are not fit for the purpose. I wonder why you try all kind of approaches instead of using a solution with which I started this thread, and which I tested on tens of thousands of book and document pages. Just try it, even with the "quick and dirty" cardboard v-cradle.