View Single Post
Old 04-30-2011, 05:43 AM   #7
Iain
Enthusiast
Iain began at the beginning.
 
Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
I looked at OmniPage and FineReader. I selected FineReader after some informal testing which indicated OmniPage was more likely to end up with pages of incorrect bold.

Since then I've also seen this effect in FineReader so my coarse cut may have been too coarse!

I've converted about 450 books with FineReader and I've (re-)read/proofed maybe 70 of them.

My impressions from this (I've not yet formalised this) is that FineReader has a few systematic errors.

It often sees tl as d (but this depends on the book)
It struggles with when a symbol is I or 1 (eye or one )
It is (relatively) poor at getting paragraph breaks right
It is (relatively) poor at de-hyphenating words on line breaks
It struggles more with books where the paper has darkened (poor contrast ratio)
It can get the formatting confused - I think this mainly happens when the page is scanned at an angle - somehting which is hard to eliminate.
It has trouble with italic and especially exclamation marks.
Punctuation is a bit dodgy. In particular quotes marks are probably often missed out. I can't say that I notice this a huge amount since it is usually very clear from context what is happening However, when you get a long dialogue where the speaker change is only indicated by the quotation marks, this can be a bit troublesome.

Having said all that, for probably 9 books out of 10 the conversion is sufficiently close to perfect that I need to be in nit-picky mode to find errors - there might be a mistaken character every ten pages or so. (paragraph and hyphenation errors are more common than this, but I've preprocessed the FineReader output to correct most of these).

There's probably half a dozen books that are a struggle to read. In those I've investigated so far, the problem seems associated with poor contrast ratio. For example I have the Herris Serrano series. The first few are barely readable. The last few are nearly perfect. They are all from the same house with the same basic layout font and size. The difference is that the first batch are quite seriously browned.

I'm hoping to do a broader comparison of the main OCR players sometime in the next few months and will post my results!

Iain
Iain is offline   Reply With Quote