View Single Post
Old 12-14-2022, 12:21 PM   #1690
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,162
Karma: 60406498
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by The Holy View Post
I have two ideas for plugins to identify books with incorrect metadata.

Misidentified check:
A plugin that runs full text search on all books in text based formats
It matches the title and last name of the author and makes sure an exact match exists inside the book.
If multiple authors exist one last name match from any of them is enough but the title must always match exactly. Case agnostic.
This will find many if not all misidentified books. Some false positives can be expected.

Language check:
Compare the language that is set for each book to its actual contents => only for text based formats
and
Compare the language that is set for each book to what languages are used in the title and comments.
For example by looking for non-english characters and words in title or comments when a book is set to language: English
E.g. The, Der, Die, Das, La, Le, Il, Å, Ä, Ö, Æ, 诶, ēi, も, अ, ب. Perhaps only do the most common languages if it gets to be too complicated.
Perhaps include a setting for minimum matches per page/number of words and/or matches total per book to avoid false positives.
And perhaps only check first 10, 10 in the middle and last 5 pages.
Dictionaries may be a frequent false positive.

Maybe these would be best combined into one plugin so that it checks the language is the same in metadata and the book as well as matching the author and title.
"Misidentified check" or "Fix match" for example.
Or perhaps be added to a plugin like quality check?
took me less than 30 seconds to come up with an example that would fail:
Quote:
Das Boot

Das Boot is a 1981 West German war film written and directed by Wolfgang Petersen, produced by Günter Rohrbach
It clearly meets your tests, but the book /movie is in English (could be either)
theducks is offline   Reply With Quote