View Single Post
Old 12-16-2022, 03:45 AM   #1695
Fiat_Lux
Addict
Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.
 
Fiat_Lux's Avatar
 
Posts: 394
Karma: 6700000
Join Date: Jan 2012
Location: Gimel
Device: tablets
Quote:
Originally Posted by The Holy View Post
Getting the top 100 (or x amount) most common words from the languages and deleting all duplicates would make a list of the most common and unique words. Perhaps that would be a good start.
Offhand, I don't remember how useful deleting duplicate words from top X word lists is.

What you don't want to happen, is what happened with the Afrikaans dictionary for OpenOffice.org. The final, automated proofreading, was running it against the South African English dictionary, and deleting words found in that dictionary. There was a list of words to be added back in --- "boer", "bakkie", other obvious Afrikaans words that English captured --- but the word "die" took almost a decade to migrate into that "add word back in list". "Die" is Afrikaans for "The".

Quote:
I was able to quickly find a few words each for English, French, German, Spanish, Swedish, and Italian which were only found in one of their books. Meaning, a Ctrl + f search for whole words in the e-book viewer, which only returned results for one of the books.
When languages are very closely related --- Catalan, Valencian, and Spanish, for example --- the unique word list gets very big, if reliability and accuracy is to be maintained.

Quote:
Algorithms or a system that could identify any language out of the box would be interesting to test if it already exists. I do wonder, however, what the feasibility of that approach would be in terms of complexity and compute intensity.
/opt/libreoffice7.4/share/fingerprint/ contains the data that LibreOffice uses to differentiate between languages.

I've forgotten where in the LibreOffice codebase their implementation resides.

The algorithm LibreOffice uses is neither complex, nor computer intense.

I learned to program using "If Then" & GoTo statements. (Standard Library? What is that? ) If the wanted algorithm wasn't in either Knuth's _The Art of Computer Programming_ or Sedgewick, brute force a working solution. An approach that is guaranteed to produce umpteen bugs per line of code. Once a working version exists, throw it all away, and write the program using procedures and functions.

Quote:
I agree we should start small before expanding to multiple languages, perhaps just English and one other. A basic plugin would be great to start testing.
Start with English/Not English, and then expand languages.

###

After thinking some more about it, I'd push for two plugins. One glyph/letter based, and one word based. The former for rough identification and the latter for precise identification.
Fiat_Lux is offline   Reply With Quote