MobileRead Forums - View Single Post

DSpider · 01-04-2012, 07:31 AM

tl;dr Stop whining. Use Ctrl+H.

Hour-long elaborate reply:

Quote:

If the OCR is this unreliable I spend more time proofreading the result than just typing it all in myself and that completely defeats the purpose of OCR.

Proofreading is A HECK OF A LOT FASTER than manually typing an entire book. Are you kidding me?

You couldn't type 30 pages without making a mistake somewhere at some line. You're bound to skip an accent or an italic or two. Do this before bedtime or after a hard day's work and the accuracy will drop even lower. Manually typing it in is insane! This isn't 1976, you know... It takes a lot a mental effort to get everything right. Much faster and more accurate to focus on the little tidbits that an automated process may have interpreted wrong (not missed, like how some of the initial Gutenberg e-books were missing entire lines because they were manually typed in).

Anyway.

It's true that FineReader has a lot of idiosyncrasies, especially with multi-language documents. Probably because of its dictionary-based approach. When a group of characters are recognized, they're cross-referenced against a built in dictionary specific to a language, for better accuracy. And some languages are very similar. You see, most languages have a substrate, a "root" if you will... In Europe, a lot of them have a Latin substrate (the Roman Empire conquered territory like mad left and right), some are sprinkled with Slavic, Germanic, etc (which, btw, English has roots from Germanic settlers known as "Anglo-Saxons").

And the more similar languages are used in the scanned material, the more chances are that something will get recognized in a different form... Sometimes without an accent, sometimes a completely different word. Here's an example from personal experience using the default Romanian dictionary with a multi-language book:

English: "Stuart Hall" is recognized as "Stuart Hali" throughout the book

French: "Xavier Molénat" is recognized as "Xavier Molenat" throughout the book

This usually happens with names. Mainly because there are so, SO many names on this earth that it would be either very difficult or maybe even impossible to add them all to each dictionary for each language. Nor should you. Because some of them will look very similar, and in multi-language documents they may just cancel each other out. So all that effort would be for nothing.

The solution? Either to add them to their specific dictionary so that the next time it gets it right, or, preferably, use the batch replace command (Ctrl+H), but don't use "Replace All" ! Use "Find Next" instead and only hit "Replace" if looks like it should be replaced in the scan window. More accuracy this way in case there are instances of "Haliwunderschmidt" or "wisenheimer" in your case.

It may not be perfect, but it's definitely an improvement over version 10 where it always closed quotes with straight quotes and had various other peculiarities. Older versions were even worse! Don't get me started on "cl" or "tl" seen as "d"... FineReader has gotten much better over the years.