Quote:
Originally Posted by Fiat_Lux
There are a couple of fairly standard algorithms to determine the language of a text.
The most common utilize letter frequency tables.
The least common utilize both word frequency and letter frequency tables.
|
Getting the top 100 (or x amount) most common words from the languages and deleting all duplicates would make a list of the most common and unique words. Perhaps that would be a good start.
Quote:
Depending upon how the program and database is structured, adding a new language can be as easy as dropping a new, language-specific database in a specific folder, and telling the program what the language is, or as complicated as adding new fields to the database, replacing the old database with the new, updated database.
Make it two, or maybe three different plugins.
a) It isn't uncommon for official documents from either governments, or NGOs to be in two or more languages.
b) Databases of word frequency tables can become very large, very quickly.
For a first cut plug-in, I'd mandate UTF-8 glyphs, and use them to break the book into writing system, and from that, use letter frequencies for the specific language. The virtue of this approach is that it can guess the language of any document thrown at it, with an acceptable degree of inaccuracy.
Either a second plug-in, or a more advanced version, would use word frequencies, with an initial draft of English/not-English, then expand to the ten most common languages, and when that is bug free, go to the 20 most common languages, and then jump to 50, 100, and, maybe 200 most common spoken languages.
Third parties willing to provide letter and/or word frequency tables would enable faster expansion and inclusion of minority and/or endangered and/or extinct and/or conlangs than would otherwise be the case.
|
I did think a little more about this earlier today and had the text in the image below written up. I post it as such to avoid the screen getting too crowded. In short, I was able to quickly find a few words each for English, French, German, Spanish, Swedish, and Italian which were only found in one of their books. Meaning, a Ctrl + f search for whole words in the e-book viewer, which only returned results for one of the books.
Algorithms or a system that could identify any language out of the box would be interesting to test if it already exists. I do wonder, however, what the feasibility of that approach would be in terms of complexity and compute intensity.
I agree we should start small before expanding to multiple languages, perhaps just English and one other. A basic plugin would be great to start testing.