Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman

famfam · 12-27-2020, 11:04 AM

german:
habe ich nun ausgerechnet ein Buch in gothic von einem Autor, der viele Fremdsprachen im Text einsetzt. Der Haupttext ist in Altdeutsch (Fraktur). Aber es kommen etliche Seiten mit Zitaten und Vergleichen in griechisch, Latin, Französich, Englisch auf einer Seite und das über lange Strecken des Buches. Da ist FR 15 ein bischen überfordert. Die Fremdsprachen werden entweder nur als Müll erkannt (Griechisch) oder sehr fehlerhaft. Hat jemand eine Idee, wie man mit diesen Schwächen von FR15 umgehen kann?
---
english:

The ocr for oldgerman, oldenglish, oldfrench works very well. But in my textexample I now have a book in Gothic by an author who uses many foreign languages in the text. The main text is in Old German (Fraktur). But there are several pages with quotations and comparisons in Greek, Latin, French, English on one page and that over long stretches of the book. FR 15 is a bit overwhelmed. The foreign languages are either only recognized as garbage (Greek) or very faulty. Does anyone have any idea how to deal with these weaknesses of FR15?

Tex2002ans · 12-28-2020, 09:36 PM

Quote:

Originally Posted by famfam

The ocr for oldgerman, oldenglish, oldfrench works very well. But in my textexample I now have a book in Gothic by an author who uses many foreign languages in the text. The main text is in Old German (Fraktur). But there are several pages with quotations and comparisons in Greek, Latin, French, English on one page and that over long stretches of the book. FR 15 is a bit overwhelmed.

1. Under Document Language, you want to select the dropdown, then "More Languages...".

2. Choose "Specify Languages Manually", then check the checkboxes for which languages you want to detect:

For example, I use this:

Code:

English; German; French;

This allows Finereader to detect ç, or other accented characters.

Note: Don't go too overboard with languages though. Finereader uses this to look up dictionary words + add certain letters in the alphabet. The more languages you add, the more likely there will be false positives.

For example, "der" is a German word, but isn't an English word, so an English OCR error like "un der" will be considered okay (since it'll think it's German).

Quoth · 12-29-2020, 03:59 PM

But a book might have English dialect. Then der means there.

At the end of the day you need good proofreading skills.

famfam · 12-29-2020, 04:59 PM

Quote:

Originally Posted by Tex2002ans

1. Under Document Language, you want to select the dropdown, then "More Languages...".

2. Choose "Specify Languages Manually", then check the checkboxes for which languages you want to detect:

For example, I use this:

Code:

English; German; French;

This allows Finereader to detect ç, or other accented characters.

Note: Don't go too overboard with languages though. Finereader uses this to look up dictionary words + add certain letters in the alphabet. The more languages you add, the more likely there will be false positives.

For example, "der" is a German word, but isn't an English word, so an English OCR error like "un der" will be considered okay (since it'll think it's German).

german:
Ich hatte so gedacht:
Wenn der Haupttext des Buches in Altdeutsch ist, dann nehme ich Altdeutsch in OCR. Wenn nun im Text weiteren Sprachen und Schriften verwendet werden, dann füge ich generell die weiteren Sprachen zur OTR-Liste hinzu. Und damit starte ich den Erkennungsprozess für das gesamte Buch. Band für Band. Bei 4 Bänden kommt man leicht auf 2000 Seiten. Dass das nicht funtioniert ist doch wohl eine Schwäche von FR 15 oder? Ich verstehe nicht, wo das Problem ist, FR 15 auf diese Höchstleistung zu bringen. Eigentlich müsste doch möglich sein, ein Programm zu machen, dass den Text Wort für Wort liest, und bei jedem Wort automatisch die Sprache und Schrift und erkennt, und das richtige Wörterbuch zuordnet. Dann müsste das Programm die Erkennungsdiagnose in eine Liste schreiben oder für jede Seite so eine Liste schreiben. Dann braucht das Programm beim letzen OCR-Durchlauf nur anhand der am Anfang geschriebenen Liste oder Listen zu übersetzen. In den Listen steht doch drin, welches Wörterbuch für welches Wort zuständig ist. Ist das alles wirklich so viel komplizierter als ich mir das denke?
english:
I had thought like this: If the main text of the book is in Old German, I'll use Old German in OCR. If other languages and fonts are used in the text, then I generally add the other languages to the OTR list. And with that I start the recognition process for the entire book. Band by band. With 4 volumes you can easily get to 2000 pages. That it doesn't work is a weakness of FR 15, isn't it? I don't understand where the problem is getting FR 15 up to this peak. It should actually be possible to make a program that reads the text word for word and automatically recognizes the language and script for every word, and finds the riht dictionary for every word. Then the program would have to write the detection diagnosis in a list or write such a list for each page. Then the program only needs to translate for the last OCR run using the list or lists written at the beginning. The lists say which dictionary is responsible for which word. Is it really all that much more complicated than I think?

Tex2002ans · 12-30-2020, 12:36 AM

Quote:

Originally Posted by famfam

I had thought like this: If the main text of the book is in Old German, I'll use Old German in OCR.

Yes, this is good.

Quote:

Originally Posted by famfam

If other languages and fonts are used in the text, then I generally add the other languages to the OTR list. And with that I start the recognition process for the entire book. Band by band. With 4 volumes you can easily get to 2000 pages.

Yes, this good.

For example, I work on books that are "99% English", but they have many:

German books/names
Occasional French word like "façade"
quotes/poems in original language

When you choose an OCR language in the dropdown, this enables two major things:

Alphabets
Dictionaries

Alphabets

Choosing English enables these basic characters:

A-Z + a-z

English doesn't commonly use accented characters, so if OCR ran across an 'ö', Finereader will probably think the diaeresis is specks of dust. It will guess you meant an 'o'.

Choosing German enables more letters + accented characters:

ßäöü

And let's say you worked on a Spanish book, you'd get letters like ñ in "mañana":

áéíñóúü

French:

àâæçèéêëÿœ

(True alphabets Finereader uses is hidden in SPOILER.)

Spoiler:

Dictionaries

Another way OCR becomes more accurate is using words from the actual language.

Let's say you had a sentence:

Code:

The swordfish was found un der the sea.

If your book was English, OCR might look at that and say:

Hmmm, "un" + "der" isn't English words, but "under" is in the English dictionary. Most likely that little space was a little font issue or scanning artifact.

If it's 99.9% sure, it MAY combine those into "under".

When you add in German dictionary, it will think differently.

"un" + "der" are two valid German words, so OCR will now think:

Code:

The swordfish was found <--- English
un der <--- German
the sea. <--- English

Now instead of auto-correcting, you're leaving in many OTHER types of potential errors.

The more dictionaries you add, the more of this type gets introduced, which is why you want to use the MINIMAL AMOUNT OF LANGUAGES POSSIBLE.

Quote:

Originally Posted by famfam

Is it really all that much more complicated than I think?

Yes.

You can read a little about this in:

"Strategies for Reducing and Correcting OCR Errors" by Martin Volk, Lenz Furrer and Rico Sennrich (Language Technology for Cultural Heritage)
https://www.researchgate.net/publica...ing_OCR_Errors

They go through a few other corrections at each stage (like patterns + merging + book-level statistics).

Quote:

Originally Posted by famfam

That it doesn't work is a weakness of FR 15, isn't it? I don't understand where the problem is getting FR 15 up to this peak.

It already does a great job of detecting which languages are in a document.

If you select the "Document Language" dropdown, you can see a selection called "Automatically select document language from the following list".

There, you can choose which common languages you run across.

For example, mine has:

Code:

English; French; German; Italian; Spanish

This helps Finereader automatically "guess" within a small subset of languages.

Let's say my Finereader runs across a lot of umlauts, it'll go:

"Hmmm, there seems to be A LOT of errors on this page, maybe English language is wrong, let me run this paragraph again through German."

Quote:

Originally Posted by famfam

It should actually be possible to make a program that reads the text word for word and automatically recognizes the language and script for every word, and finds the riht dictionary for every word.

No. This is an infinitely hard problem.

Which language is this word?

Code:

canal <--- English
canal <--- Spanish
canal <--- Portuguese
canal <--- Catalan

"canal" in English is a waterway that boats can pass through.

"canal" in those other 3 languages means channel, as in "change the TV channel".

To guess document's language, you need more text, like an entire phrase/sentence/paragraph/page.

Then you can begin using statistics + dictionaries.

For example:

Code:

subscribe to my channel. <--- English
suscribirse a mi canal. <--- Spanish
inscreva-se no meu canal. <--- Portuguese (Brazil)
subscriure's al meu canal. <--- Catalan
iscriviti al mio canale. <--- Italian

Let's say you're guessing between those languages, you might see:

accents that only exist in a certain language (like German ß, Spanish ñ. French ÿ).
words/combinations that are more common in a single language.

if that doesn't work, you start looking at larger collections of words (called n-grams), but there's still a large amount of overlap between languages.

Computers are getting pretty good (see pasting into Google Translate), but when you start getting into minutiae, like Portuguese (Portugal) + Portuguese (Brazil)... things become much harder. Better for humans to give the computer hints than to leave the computer 100% guessing.

Quote:

Originally Posted by Quoth

But a book might have English dialect. Then der means there.

At the end of the day you need good proofreading skills.

Or you help OCR along, by telling it what languages you're dealing with, then the statistics + red squigglies really help.

isaacbh · 02-23-2021, 06:18 PM

Also you can specify a language for each text zone, if you have a block of text of the same language.

12-27-2020, 11:04 AM	#1
famfam Connoisseur Posts: 77 Karma: 2178856 Join Date: Oct 2013 Device: Kobo Clara HD	Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman german: habe ich nun ausgerechnet ein Buch in gothic von einem Autor, der viele Fremdsprachen im Text einsetzt. Der Haupttext ist in Altdeutsch (Fraktur). Aber es kommen etliche Seiten mit Zitaten und Vergleichen in griechisch, Latin, Französich, Englisch auf einer Seite und das über lange Strecken des Buches. Da ist FR 15 ein bischen überfordert. Die Fremdsprachen werden entweder nur als Müll erkannt (Griechisch) oder sehr fehlerhaft. Hat jemand eine Idee, wie man mit diesen Schwächen von FR15 umgehen kann? --- english: The ocr for oldgerman, oldenglish, oldfrench works very well. But in my textexample I now have a book in Gothic by an author who uses many foreign languages in the text. The main text is in Old German (Fraktur). But there are several pages with quotations and comparisons in Greek, Latin, French, English on one page and that over long stretches of the book. FR 15 is a bit overwhelmed. The foreign languages are either only recognized as garbage (Greek) or very faulty. Does anyone have any idea how to deal with these weaknesses of FR15?

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Need help with Abbyy Finereader 10 (linebreaks)	NASCARaddicted	Workshop	11	01-19-2017 04:10 PM
If I have ABBYY Finereader, do I need ABBYY PDF Transformer?	graycyn	PDF	2	06-12-2012 06:23 PM
Abbyy Finereader 11 Pro $99	chainring	Deals and Resources (No Self-Promotion or Affiliate Links)	6	02-13-2012 07:12 AM
Abbyy FineReader Dictionaries	Mebyon	Workshop	2	02-10-2010 02:57 PM
ABBYY FineReader cannot see images	chinesealbumart	Workshop	8	05-15-2009 11:03 PM

12-29-2020, 03:59 PM	#3
Quoth the rook, bossing Never. Posts: 11,045 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	But a book might have English dialect. Then der means there. At the end of the day you need good proofreading skills.

02-23-2021, 06:18 PM	#6
isaacbh Connoisseur Posts: 57 Karma: 98196 Join Date: Mar 2015 Location: Israel Device: Kobo Aura H20	Also you can specify a language for each text zone, if you have a block of text of the same language.