MobileRead Forums - View Single Post

Fiat_Lux · 12-15-2022, 07:17 PM

Quote:

Originally Posted by The Holy

I have two ideas for plugins to identify books with incorrect metadata.

Misidentified check:
A plugin that runs full text search on all books in text based formats
It matches the title and last name of the author and makes sure an exact match exists inside the book.
If multiple authors exist one last name match from any of them is enough but the title must always match exactly. Case agnostic.
This will find many if not all misidentified books. Some false positives can be expected.

Sheila Lowe _Dead Letters_ is a potential example of a false positive. Nine months it was released the cover designer, author, and publishing team discovered that the name on the cover and the name of the book were different.
CF https://www.goodreads.com/book/show/...6-dead-letters

Quote:

Language check:
Compare the language that is set for each book to its actual contents => only for text based formats
and
Compare the language that is set for each book to what languages are used in the title and comments.
For example by looking for non-English characters and words in title or comments when a book is set to language: English

English has never met a word that it has not claimed to be a part of the language.

Quote:

E.g. The, Der, Die, Das, La, Le, Il, Å, Ä, Ö, Æ, 诶, ēi, も, अ, ب. Perhaps only do the most common languages if it gets to be too complicated.
Perhaps include a setting for minimum matches per page/number of words and/or matches total per book to avoid false positives.
And perhaps only check first 10, 10 in the middle and last 5 pages.
Dictionaries may be a frequent false positive.

There are a couple of fairly standard algorithms to determine the language of a text.
The most common utilize letter frequency tables.
The least common utilize both word frequency and letter frequency tables.

Depending upon how the program and database is structured, adding a new language can be as easy as dropping a new, language-specific database in a specific folder, and telling the program what the language is, or as complicated as adding new fields to the database, replacing the old database with the new, updated database.

Quote:

Maybe these would be best combined into one plugin so that it checks the language is the same in metadata and the book as well as matching the author and title.
"Misidentified check" or "Fix match" for example.
Or perhaps be added to a plugin like quality check?

Make it two, or maybe three different plugins.
a) It isn't uncommon for official documents from either governments, or NGOs to be in two or more languages.
b) Databases of word frequency tables can become very large, very quickly.

###

_Ethnologue_ claims that there are 7151 spoken languages today, with 4169 having a developed writing system, and a further 151 languages that are exclusively signed.

_Wycliffe Bible Translators_ claims that there are 7388 spoken languages, or which the Bible has been fully translated into 724 languages, and 3266 languages have an ongoing translation project.

For various reasons, I put slightly greater credence on _Wycliffe Bible Translators_ data, than on Ethnologue data.

For a first cut plug-in, I'd mandate UTF-8 glyphs, and use them to break the book into writing system, and from that, use letter frequencies for the specific language. The virtue of this approach is that it can guess the language of any document thrown at it, with an acceptable degree of inaccuracy.

Either a second plug-in, or a more advanced version, would use word frequencies, with an initial draft of English/not-English, then expand to the ten most common languages, and when that is bug free, go to the 20 most common languages, and then jump to 50, 100, and, maybe 200 most common spoken languages.

Third parties willing to provide letter and/or word frequency tables would enable faster expansion and inclusion of minority and/or endangered and/or extinct and/or conlangs than would otherwise be the case.