Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 12-27-2020, 11:04 AM   #1
famfam
Connoisseur
famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.
 
Posts: 77
Karma: 2178856
Join Date: Oct 2013
Device: Kobo Clara HD
Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman

german:
habe ich nun ausgerechnet ein Buch in gothic von einem Autor, der viele Fremdsprachen im Text einsetzt. Der Haupttext ist in Altdeutsch (Fraktur). Aber es kommen etliche Seiten mit Zitaten und Vergleichen in griechisch, Latin, Französich, Englisch auf einer Seite und das über lange Strecken des Buches. Da ist FR 15 ein bischen überfordert. Die Fremdsprachen werden entweder nur als Müll erkannt (Griechisch) oder sehr fehlerhaft. Hat jemand eine Idee, wie man mit diesen Schwächen von FR15 umgehen kann?
---
english:

The ocr for oldgerman, oldenglish, oldfrench works very well. But in my textexample I now have a book in Gothic by an author who uses many foreign languages in the text. The main text is in Old German (Fraktur). But there are several pages with quotations and comparisons in Greek, Latin, French, English on one page and that over long stretches of the book. FR 15 is a bit overwhelmed. The foreign languages are either only recognized as garbage (Greek) or very faulty. Does anyone have any idea how to deal with these weaknesses of FR15?
famfam is offline   Reply With Quote
Old 12-28-2020, 09:36 PM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by famfam View Post
The ocr for oldgerman, oldenglish, oldfrench works very well. But in my textexample I now have a book in Gothic by an author who uses many foreign languages in the text. The main text is in Old German (Fraktur). But there are several pages with quotations and comparisons in Greek, Latin, French, English on one page and that over long stretches of the book. FR 15 is a bit overwhelmed.
1. Under Document Language, you want to select the dropdown, then "More Languages...".

2. Choose "Specify Languages Manually", then check the checkboxes for which languages you want to detect:

For example, I use this:

Code:
English; German; French;
This allows Finereader to detect ç, or other accented characters.

Note: Don't go too overboard with languages though. Finereader uses this to look up dictionary words + add certain letters in the alphabet. The more languages you add, the more likely there will be false positives.

For example, "der" is a German word, but isn't an English word, so an English OCR error like "un der" will be considered okay (since it'll think it's German).
Tex2002ans is offline   Reply With Quote
Old 12-29-2020, 03:59 PM   #3
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,045
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
But a book might have English dialect. Then der means there.

At the end of the day you need good proofreading skills.
Quoth is offline   Reply With Quote
Old 12-29-2020, 04:59 PM   #4
famfam
Connoisseur
famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.
 
Posts: 77
Karma: 2178856
Join Date: Oct 2013
Device: Kobo Clara HD
Quote:
Originally Posted by Tex2002ans View Post
1. Under Document Language, you want to select the dropdown, then "More Languages...".

2. Choose "Specify Languages Manually", then check the checkboxes for which languages you want to detect:

For example, I use this:

Code:
English; German; French;
This allows Finereader to detect ç, or other accented characters.

Note: Don't go too overboard with languages though. Finereader uses this to look up dictionary words + add certain letters in the alphabet. The more languages you add, the more likely there will be false positives.

For example, "der" is a German word, but isn't an English word, so an English OCR error like "un der" will be considered okay (since it'll think it's German).
german:
Ich hatte so gedacht:
Wenn der Haupttext des Buches in Altdeutsch ist, dann nehme ich Altdeutsch in OCR. Wenn nun im Text weiteren Sprachen und Schriften verwendet werden, dann füge ich generell die weiteren Sprachen zur OTR-Liste hinzu. Und damit starte ich den Erkennungsprozess für das gesamte Buch. Band für Band. Bei 4 Bänden kommt man leicht auf 2000 Seiten. Dass das nicht funtioniert ist doch wohl eine Schwäche von FR 15 oder? Ich verstehe nicht, wo das Problem ist, FR 15 auf diese Höchstleistung zu bringen. Eigentlich müsste doch möglich sein, ein Programm zu machen, dass den Text Wort für Wort liest, und bei jedem Wort automatisch die Sprache und Schrift und erkennt, und das richtige Wörterbuch zuordnet. Dann müsste das Programm die Erkennungsdiagnose in eine Liste schreiben oder für jede Seite so eine Liste schreiben. Dann braucht das Programm beim letzen OCR-Durchlauf nur anhand der am Anfang geschriebenen Liste oder Listen zu übersetzen. In den Listen steht doch drin, welches Wörterbuch für welches Wort zuständig ist. Ist das alles wirklich so viel komplizierter als ich mir das denke?
english:
I had thought like this: If the main text of the book is in Old German, I'll use Old German in OCR. If other languages and fonts are used in the text, then I generally add the other languages to the OTR list. And with that I start the recognition process for the entire book. Band by band. With 4 volumes you can easily get to 2000 pages. That it doesn't work is a weakness of FR 15, isn't it? I don't understand where the problem is getting FR 15 up to this peak. It should actually be possible to make a program that reads the text word for word and automatically recognizes the language and script for every word, and finds the riht dictionary for every word. Then the program would have to write the detection diagnosis in a list or write such a list for each page. Then the program only needs to translate for the last OCR run using the list or lists written at the beginning. The lists say which dictionary is responsible for which word. Is it really all that much more complicated than I think?
famfam is offline   Reply With Quote
Old 12-30-2020, 12:36 AM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by famfam View Post
I had thought like this: If the main text of the book is in Old German, I'll use Old German in OCR.
Yes, this is good.

Quote:
Originally Posted by famfam View Post
If other languages and fonts are used in the text, then I generally add the other languages to the OTR list. And with that I start the recognition process for the entire book. Band by band. With 4 volumes you can easily get to 2000 pages.
Yes, this good.

For example, I work on books that are "99% English", but they have many:
  • German books/names
  • Occasional French word like "façade"
  • quotes/poems in original language

When you choose an OCR language in the dropdown, this enables two major things:
  • Alphabets
  • Dictionaries

Alphabets

Choosing English enables these basic characters:

A-Z + a-z

English doesn't commonly use accented characters, so if OCR ran across an 'ö', Finereader will probably think the diaeresis is specks of dust. It will guess you meant an 'o'.

Choosing German enables more letters + accented characters:

ßäöü

And let's say you worked on a Spanish book, you'd get letters like ñ in "mañana":

áéíñóúü

French:

àâæçèéêëÿœ

(True alphabets Finereader uses is hidden in SPOILER.)

Spoiler:
Code:
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz’ (English)
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÄÖÜßäöü’ (German)
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÁÉÍÑÓÚÜáéíñóúü’ (Spanish)
'-.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÂÆÇÈÉÊËÎÏÔÙÛÜàâæçèéêëîïôùûüÿŒœŸ’ (French)


Dictionaries

Another way OCR becomes more accurate is using words from the actual language.

Let's say you had a sentence:

Code:
The swordfish was found un der the sea.
If your book was English, OCR might look at that and say:

Hmmm, "un" + "der" isn't English words, but "under" is in the English dictionary. Most likely that little space was a little font issue or scanning artifact.

If it's 99.9% sure, it MAY combine those into "under".

When you add in German dictionary, it will think differently.

"un" + "der" are two valid German words, so OCR will now think:

Code:
The swordfish was found <--- English
un der <--- German
the sea. <--- English
Now instead of auto-correcting, you're leaving in many OTHER types of potential errors.

The more dictionaries you add, the more of this type gets introduced, which is why you want to use the MINIMAL AMOUNT OF LANGUAGES POSSIBLE.

Quote:
Originally Posted by famfam View Post
Is it really all that much more complicated than I think?
Yes.

You can read a little about this in:

"Strategies for Reducing and Correcting OCR Errors" by Martin Volk, Lenz Furrer and Rico Sennrich (Language Technology for Cultural Heritage)
https://www.researchgate.net/publica...ing_OCR_Errors

They go through a few other corrections at each stage (like patterns + merging + book-level statistics).

Quote:
Originally Posted by famfam View Post
That it doesn't work is a weakness of FR 15, isn't it? I don't understand where the problem is getting FR 15 up to this peak.
It already does a great job of detecting which languages are in a document.

If you select the "Document Language" dropdown, you can see a selection called "Automatically select document language from the following list".

There, you can choose which common languages you run across.

For example, mine has:

Code:
English; French; German; Italian; Spanish
This helps Finereader automatically "guess" within a small subset of languages.

Let's say my Finereader runs across a lot of umlauts, it'll go:

"Hmmm, there seems to be A LOT of errors on this page, maybe English language is wrong, let me run this paragraph again through German."

Quote:
Originally Posted by famfam View Post
It should actually be possible to make a program that reads the text word for word and automatically recognizes the language and script for every word, and finds the riht dictionary for every word.
No. This is an infinitely hard problem.

Which language is this word?

Code:
canal <--- English
canal <--- Spanish
canal <--- Portuguese
canal <--- Catalan
"canal" in English is a waterway that boats can pass through.

"canal" in those other 3 languages means channel, as in "change the TV channel".

To guess document's language, you need more text, like an entire phrase/sentence/paragraph/page.

Then you can begin using statistics + dictionaries.

For example:

Code:
subscribe to my channel. <--- English
suscribirse a mi canal. <--- Spanish
inscreva-se no meu canal. <--- Portuguese (Brazil)
subscriure's al meu canal. <--- Catalan
iscriviti al mio canale. <--- Italian
Let's say you're guessing between those languages, you might see:
  • accents that only exist in a certain language (like German ß, Spanish ñ. French ÿ).
  • words/combinations that are more common in a single language.

if that doesn't work, you start looking at larger collections of words (called n-grams), but there's still a large amount of overlap between languages.

Computers are getting pretty good (see pasting into Google Translate), but when you start getting into minutiae, like Portuguese (Portugal) + Portuguese (Brazil)... things become much harder. Better for humans to give the computer hints than to leave the computer 100% guessing.

Quote:
Originally Posted by Quoth View Post
But a book might have English dialect. Then der means there.

At the end of the day you need good proofreading skills.
Or you help OCR along, by telling it what languages you're dealing with, then the statistics + red squigglies really help.

Last edited by Tex2002ans; 12-30-2020 at 01:09 AM.
Tex2002ans is offline   Reply With Quote
Old 02-23-2021, 06:18 PM   #6
isaacbh
Connoisseur
isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.isaacbh makes omelettes without breaking eggs.
 
Posts: 57
Karma: 98196
Join Date: Mar 2015
Location: Israel
Device: Kobo Aura H20
Also you can specify a language for each text zone, if you have a block of text of the same language.
isaacbh is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help with Abbyy Finereader 10 (linebreaks) NASCARaddicted Workshop 11 01-19-2017 04:10 PM
If I have ABBYY Finereader, do I need ABBYY PDF Transformer? graycyn PDF 2 06-12-2012 06:23 PM
Abbyy Finereader 11 Pro $99 chainring Deals and Resources (No Self-Promotion or Affiliate Links) 6 02-13-2012 07:12 AM
Abbyy FineReader Dictionaries Mebyon Workshop 2 02-10-2010 02:57 PM
ABBYY FineReader cannot see images chinesealbumart Workshop 8 05-15-2009 11:03 PM


All times are GMT -4. The time now is 08:46 PM.


MobileRead.com is a privately owned, operated and funded community.