100 most frequently used words by language
Hi All,
After recent changes in Sigil, spell checking in CodeView is the biggest time user when loading a new file.
So does anyone know where I might find a list of the 100 most frequently used words in each language?
I have found a list for English that lists the expected ("the", "and", "a", "of") but could not find a list for most other languages. But of course a page may have that list in a language but I do not know how to read it so ...
Hunspell spellchecking is really not optimized for speed because it must handle words that have prefixes, suffixes, or are compound in many languages.
My idea is to create an en_US_cache.txt file of these most frequently used 100 words in en_US.,
I would like to do that for as many of the other languages that Sigil supports.
With those files I would like to test how much faster we can make CodeView spellchecking by loading and using these caches to test.
If the speedup is significant enough to warrant the extra code of keeping these cache lists, then I can go ahead and add them to Sigil.
But I want to test as many other languages as possible so that the speedup is not just for en-US.
If anyone has such a list (one word per line, utf-8 encoded) for any other language and is willing to post it, I would be grateful.
If not enough of these most frequently used word lists are available, my second idea is to cache the last 100 words spellchecked. I do not expect it to provide the same speedup factor but it might be worth a try if the frequent word lists are not generally available for most languages.
Any thoughts, links to word lists, word lists, or feedback on the idea greatly welcome.
|