![]() |
#1 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2011
Device: none
|
FineReader 11 word recognition problems
I am currently scanning a document in German with a large number of English names and running into a huge problem.
Every time a name appears that's very similar to a German word FineReader is replacing that name without ever asking and I can't find an option to switch this extremely annoying behavior off. For example it persistently renames Damon to Dämon Elmer to Eimer Hal to Hai although the scanned document is clean and high quality and nothing whatsoever suggests that these characters are in any way ambiguous, yet none of these names gets recognized correctly even once! So, any help or am I screwed? If the OCR is this unreliable I spend more time proofreading the result than just typing it all in myself and that completely defeats the purpose of OCR. |
![]() |
![]() |
![]() |
#2 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
tl;dr Stop whining. Use Ctrl+H.
Hour-long elaborate reply: Quote:
You couldn't type 30 pages without making a mistake somewhere at some line. You're bound to skip an accent or an italic or two. Do this before bedtime or after a hard day's work and the accuracy will drop even lower. Manually typing it in is insane! This isn't 1976, you know... It takes a lot a mental effort to get everything right. Much faster and more accurate to focus on the little tidbits that an automated process may have interpreted wrong (not missed, like how some of the initial Gutenberg e-books were missing entire lines because they were manually typed in). Anyway. It's true that FineReader has a lot of idiosyncrasies, especially with multi-language documents. Probably because of its dictionary-based approach. When a group of characters are recognized, they're cross-referenced against a built in dictionary specific to a language, for better accuracy. And some languages are very similar. You see, most languages have a substrate, a "root" if you will... In Europe, a lot of them have a Latin substrate (the Roman Empire conquered territory like mad left and right), some are sprinkled with Slavic, Germanic, etc (which, btw, English has roots from Germanic settlers known as "Anglo-Saxons"). And the more similar languages are used in the scanned material, the more chances are that something will get recognized in a different form... Sometimes without an accent, sometimes a completely different word. Here's an example from personal experience using the default Romanian dictionary with a multi-language book: English: "Stuart Hall" is recognized as "Stuart Hali" throughout the book French: "Xavier Molénat" is recognized as "Xavier Molenat" throughout the book This usually happens with names. Mainly because there are so, SO many names on this earth that it would be either very difficult or maybe even impossible to add them all to each dictionary for each language. Nor should you. Because some of them will look very similar, and in multi-language documents they may just cancel each other out. So all that effort would be for nothing. The solution? Either to add them to their specific dictionary so that the next time it gets it right, or, preferably, use the batch replace command (Ctrl+H), but don't use "Replace All" ! Use "Find Next" instead and only hit "Replace" if looks like it should be replaced in the scan window. More accuracy this way in case there are instances of "Haliwunderschmidt" or "wisenheimer" in your case. It may not be perfect, but it's definitely an improvement over version 10 where it always closed quotes with straight quotes and had various other peculiarities. Older versions were even worse! Don't get me started on "cl" or "tl" seen as "d"... FineReader has gotten much better over the years. Last edited by DSpider; 01-04-2012 at 07:45 AM. |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2011
Device: none
|
Quote:
But the inverse is not true. If a word gets scanned that does not contain any accent or umlaut but the dictionary contains a word that is identical except one such character it gets mercilessly replaced without any chance of opting out - even if the scanned page is in perfect condition. I tried the dictionary approach, btw. Didn't help. And now good luck finding all such words! Most of the times it's names so that just proofreading the scanned document wouldn't help because you can't easily tell if the name is wrong in many, many cases. This is something that would be fine as an option but having such a feature on by default with no decent option to disable it is - pardon me - just a sign of fundamentally broken software. There is no means to get around it and no error threshold below which this nonsense isn't done. Yes, FineReader 11 is a lot better than 10 but it still got major problems distinguishing 'i' from 'l' (why? Is the gap below the dot that hard to detect?) or 'm' from 'rn' and these 2 along with the stupid umlaut problem are my main source of frustration with it, mainly because these are so easy to overlook when proofreading so if you want to make sure you have to do it at least twice, preferably by different persons. Effectively these issues make up 90% of all the proofreading time because they happen far more often than any other misrecognition. Last edited by Karl Murks; 01-04-2012 at 11:04 AM. |
|
![]() |
![]() |
![]() |
#4 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
If the only issue with German is that the i and l get confused sometimes, and only for stuff not in the dictionary (ie. names, English names in particular) which you can easily batch replace, then it means you have it EASY! For Romanian, FineReader has a big, BIG problem with the capital letter Î, where it's always recognized as a lowercase, no matter the circumstance. And it's like twice its size! î vs Î !
And I always batch replace: . î ! î ? î : î ; î ...with their appropriate uppercase, depending on the situation. Like for ":" and ";" I can't just hit "Replace All", sometimes I have to go through all instances with "Find Next". Unfortunately I can't batch replace anything that starts with the letter î at the beginning of a paragraph! FineReader can only search for Optional Hyphen and Line Break, not paragraph ending marks so that maybe I could search for "¶ î", or something like "^p î" to make my life easier... Then there's the issue with words ending in "-1" instead of "-l" (short form, kinda like the English it's, don't, grandson's, etc.) ... Yes, it takes a few extra minutes sorting it all out but it's NOTHING like typing the whole book from scratch. You should appreciate it more, this FineReader. It's currently the best available, hands down. But hey, if you're really confident that you can do a better job typing it in, be my guest! |
![]() |
![]() |
![]() |
#5 | |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2011
Device: none
|
Quote:
But I'm currently handling a text that contains both the words 'Dämon' and 'Damon' more than 1500 times and both words appear frequently in the context. This is going to be a real nightmare to sort out - so please forgive me that I'm pissed at FineReader right now. There's no way to automate this task. It has to be done all by hand and wouldn't be necessary if that damn software wasn't that persistent to screw up the result. |
|
![]() |
![]() |
Advert | |
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Creating MOBI ebooks for Kindle using FineReader, Word & Mobi Creator | shoreline | Workshop | 1 | 01-28-2011 03:37 AM |
Finereader questions | proxy | Workshop | 1 | 11-07-2010 02:13 AM |
Problems converting Word docs | ficbot | Sony Reader | 4 | 05-15-2009 07:36 PM |
finereader training | pimpoum | Workshop | 1 | 05-04-2009 02:23 PM |
Romance Ebers, Georg: A Word, Only a Word. V1. 20 Mar 2009 | crutledge | Kindle Books | 0 | 03-20-2009 08:14 AM |