MobileRead Forums - View Single Post

Fiat_Lux · 12-16-2022, 03:45 AM

Quote:

Originally Posted by The Holy

Getting the top 100 (or x amount) most common words from the languages and deleting all duplicates would make a list of the most common and unique words. Perhaps that would be a good start.

Offhand, I don't remember how useful deleting duplicate words from top X word lists is.

What you don't want to happen, is what happened with the Afrikaans dictionary for OpenOffice.org. The final, automated proofreading, was running it against the South African English dictionary, and deleting words found in that dictionary. There was a list of words to be added back in --- "boer", "bakkie", other obvious Afrikaans words that English captured --- but the word "die" took almost a decade to migrate into that "add word back in list". "Die" is Afrikaans for "The".

Quote:

I was able to quickly find a few words each for English, French, German, Spanish, Swedish, and Italian which were only found in one of their books. Meaning, a Ctrl + f search for whole words in the e-book viewer, which only returned results for one of the books.

When languages are very closely related --- Catalan, Valencian, and Spanish, for example --- the unique word list gets very big, if reliability and accuracy is to be maintained.

Quote:

Algorithms or a system that could identify any language out of the box would be interesting to test if it already exists. I do wonder, however, what the feasibility of that approach would be in terms of complexity and compute intensity.

/opt/libreoffice7.4/share/fingerprint/ contains the data that LibreOffice uses to differentiate between languages.

I've forgotten where in the LibreOffice codebase their implementation resides.

The algorithm LibreOffice uses is neither complex, nor computer intense.

I learned to program using "If Then" & GoTo statements. (Standard Library? What is that? ) If the wanted algorithm wasn't in either Knuth's _The Art of Computer Programming_ or Sedgewick, brute force a working solution. An approach that is guaranteed to produce umpteen bugs per line of code. Once a working version exists, throw it all away, and write the program using procedures and functions.

Quote:

I agree we should start small before expanding to multiple languages, perhaps just English and one other. A basic plugin would be great to start testing.

Start with English/Not English, and then expand languages.

###

After thinking some more about it, I'd push for two plugins. One glyph/letter based, and one word based. The former for rough identification and the latter for precise identification.