View Single Post
Old 12-30-2020, 12:36 AM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by famfam View Post
I had thought like this: If the main text of the book is in Old German, I'll use Old German in OCR.
Yes, this is good.

Quote:
Originally Posted by famfam View Post
If other languages and fonts are used in the text, then I generally add the other languages to the OTR list. And with that I start the recognition process for the entire book. Band by band. With 4 volumes you can easily get to 2000 pages.
Yes, this good.

For example, I work on books that are "99% English", but they have many:
  • German books/names
  • Occasional French word like "façade"
  • quotes/poems in original language

When you choose an OCR language in the dropdown, this enables two major things:
  • Alphabets
  • Dictionaries

Alphabets

Choosing English enables these basic characters:

A-Z + a-z

English doesn't commonly use accented characters, so if OCR ran across an 'ö', Finereader will probably think the diaeresis is specks of dust. It will guess you meant an 'o'.

Choosing German enables more letters + accented characters:

ßäöü

And let's say you worked on a Spanish book, you'd get letters like ñ in "mañana":

áéíñóúü

French:

àâæçèéêëÿœ

(True alphabets Finereader uses is hidden in SPOILER.)

Spoiler:
Code:
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz’ (English)
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÄÖÜßäöü’ (German)
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÁÉÍÑÓÚÜáéíñóúü’ (Spanish)
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÂÆÇÈÉÊËÎÏÔÙÛÜàâæçèéêëîïôùûüÿŒœŸ’ (French)


Dictionaries

Another way OCR becomes more accurate is using words from the actual language.

Let's say you had a sentence:

Code:
The swordfish was found un der the sea.
If your book was English, OCR might look at that and say:

Hmmm, "un" + "der" isn't English words, but "under" is in the English dictionary. Most likely that little space was a little font issue or scanning artifact.

If it's 99.9% sure, it MAY combine those into "under".

When you add in German dictionary, it will think differently.

"un" + "der" are two valid German words, so OCR will now think:

Code:
The swordfish was found <--- English
un der <--- German
the sea. <--- English
Now instead of auto-correcting, you're leaving in many OTHER types of potential errors.

The more dictionaries you add, the more of this type gets introduced, which is why you want to use the MINIMAL AMOUNT OF LANGUAGES POSSIBLE.

Quote:
Originally Posted by famfam View Post
Is it really all that much more complicated than I think?
Yes.

You can read a little about this in:

"Strategies for Reducing and Correcting OCR Errors" by Martin Volk, Lenz Furrer and Rico Sennrich (Language Technology for Cultural Heritage)
https://www.researchgate.net/publica...ing_OCR_Errors

They go through a few other corrections at each stage (like patterns + merging + book-level statistics).

Quote:
Originally Posted by famfam View Post
That it doesn't work is a weakness of FR 15, isn't it? I don't understand where the problem is getting FR 15 up to this peak.
It already does a great job of detecting which languages are in a document.

If you select the "Document Language" dropdown, you can see a selection called "Automatically select document language from the following list".

There, you can choose which common languages you run across.

For example, mine has:

Code:
English; French; German; Italian; Spanish
This helps Finereader automatically "guess" within a small subset of languages.

Let's say my Finereader runs across a lot of umlauts, it'll go:

"Hmmm, there seems to be A LOT of errors on this page, maybe English language is wrong, let me run this paragraph again through German."

Quote:
Originally Posted by famfam View Post
It should actually be possible to make a program that reads the text word for word and automatically recognizes the language and script for every word, and finds the riht dictionary for every word.
No. This is an infinitely hard problem.

Which language is this word?

Code:
canal <--- English
canal <--- Spanish
canal <--- Portuguese
canal <--- Catalan
"canal" in English is a waterway that boats can pass through.

"canal" in those other 3 languages means channel, as in "change the TV channel".

To guess document's language, you need more text, like an entire phrase/sentence/paragraph/page.

Then you can begin using statistics + dictionaries.

For example:

Code:
subscribe to my channel. <--- English
suscribirse a mi canal. <--- Spanish
inscreva-se no meu canal. <--- Portuguese (Brazil)
subscriure's al meu canal. <--- Catalan
iscriviti al mio canale. <--- Italian
Let's say you're guessing between those languages, you might see:
  • accents that only exist in a certain language (like German ß, Spanish ñ. French ÿ).
  • words/combinations that are more common in a single language.

if that doesn't work, you start looking at larger collections of words (called n-grams), but there's still a large amount of overlap between languages.

Computers are getting pretty good (see pasting into Google Translate), but when you start getting into minutiae, like Portuguese (Portugal) + Portuguese (Brazil)... things become much harder. Better for humans to give the computer hints than to leave the computer 100% guessing.

Quote:
Originally Posted by Quoth View Post
But a book might have English dialect. Then der means there.

At the end of the day you need good proofreading skills.
Or you help OCR along, by telling it what languages you're dealing with, then the statistics + red squigglies really help.

Last edited by Tex2002ans; 12-30-2020 at 01:09 AM.
Tex2002ans is offline   Reply With Quote