Quote:
Originally Posted by famfam
I had thought like this: If the main text of the book is in Old German, I'll use Old German in OCR.
|
Yes, this is good.
Quote:
Originally Posted by famfam
If other languages and fonts are used in the text, then I generally add the other languages to the OTR list. And with that I start the recognition process for the entire book. Band by band. With 4 volumes you can easily get to 2000 pages.
|
Yes, this good.
For example, I work on books that are "99% English", but they have many:
- German books/names
- Occasional French word like "façade"
- quotes/poems in original language
When you choose an OCR language in the dropdown, this enables two major things:
Alphabets
Choosing
English enables these basic characters:
A-Z + a-z
English doesn't commonly use accented characters, so if OCR ran across an 'ö', Finereader will probably think the diaeresis is specks of dust. It will guess you meant an 'o'.
Choosing
German enables more letters + accented characters:
ßäöü
And let's say you worked on a
Spanish book, you'd get letters like ñ in "mañana":
áéíñóúü
French:
àâæçèéêëÿœ
(True alphabets Finereader uses is hidden in SPOILER.)
Dictionaries
Another way OCR becomes more accurate is using words from the actual language.
Let's say you had a sentence:
Code:
The swordfish was found un der the sea.
If your book was
English, OCR might look at that and say:
Hmmm, "un" + "der" isn't English words, but "under" is in the English dictionary. Most likely that little space was a little font issue or scanning artifact.
If it's 99.9% sure, it MAY combine those into "under".
When you add in
German dictionary, it will think differently.
"un" + "der" are two valid German words, so OCR will now think:
Code:
The swordfish was found <--- English
un der <--- German
the sea. <--- English
Now instead of auto-correcting, you're leaving in many OTHER types of potential errors.
The more dictionaries you add, the more of this type gets introduced, which is why you want to use the MINIMAL AMOUNT OF LANGUAGES POSSIBLE.
Quote:
Originally Posted by famfam
Is it really all that much more complicated than I think?
|
Yes.
You can read a little about this in:
"Strategies for Reducing and Correcting OCR Errors" by Martin Volk, Lenz Furrer and Rico Sennrich (Language Technology for Cultural Heritage)
https://www.researchgate.net/publica...ing_OCR_Errors
They go through a few other corrections at each stage (like patterns + merging + book-level statistics).
Quote:
Originally Posted by famfam
That it doesn't work is a weakness of FR 15, isn't it? I don't understand where the problem is getting FR 15 up to this peak.
|
It already does a great job of detecting which languages are in a document.
If you select the "Document Language" dropdown, you can see a selection called "Automatically select document language from the following list".
There, you can choose which common languages you run across.
For example, mine has:
Code:
English; French; German; Italian; Spanish
This helps Finereader automatically "guess" within a small subset of languages.
Let's say my Finereader runs across a lot of umlauts, it'll go:
"Hmmm, there seems to be A LOT of errors on this page, maybe English language is wrong, let me run this paragraph again through German."
Quote:
Originally Posted by famfam
It should actually be possible to make a program that reads the text word for word and automatically recognizes the language and script for every word, and finds the riht dictionary for every word.
|
No. This is an infinitely hard problem.
Which language is this word?
Code:
canal <--- English
canal <--- Spanish
canal <--- Portuguese
canal <--- Catalan
"canal" in English is a waterway that boats can pass through.
"canal" in those other 3 languages means channel, as in "change the TV channel".
To guess document's language, you need more text, like an entire phrase/sentence/paragraph/page.
Then you can begin using statistics + dictionaries.
For example:
Code:
subscribe to my channel. <--- English
suscribirse a mi canal. <--- Spanish
inscreva-se no meu canal. <--- Portuguese (Brazil)
subscriure's al meu canal. <--- Catalan
iscriviti al mio canale. <--- Italian
Let's say you're guessing between those languages, you might see:
- accents that only exist in a certain language (like German ß, Spanish ñ. French ÿ).
- words/combinations that are more common in a single language.
if that doesn't work, you start looking at larger collections of words (called n-grams), but there's still a large amount of overlap between languages.
Computers are getting pretty good (see pasting into Google Translate), but when you start getting into minutiae, like Portuguese (Portugal) + Portuguese (Brazil)... things become much harder. Better for humans to give the computer hints than to leave the computer 100% guessing.
Quote:
Originally Posted by Quoth
But a book might have English dialect. Then der means there.
At the end of the day you need good proofreading skills.
|
Or you help OCR along, by telling it what languages you're dealing with, then the statistics + red squigglies really help.