04-16-2021, 11:29 AM | #1 |
Sigil Developer
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
|
100 most frequently used words by language
Hi All,
After recent changes in Sigil, spell checking in CodeView is the biggest time user when loading a new file. So does anyone know where I might find a list of the 100 most frequently used words in each language? I have found a list for English that lists the expected ("the", "and", "a", "of") but could not find a list for most other languages. But of course a page may have that list in a language but I do not know how to read it so ... Hunspell spellchecking is really not optimized for speed because it must handle words that have prefixes, suffixes, or are compound in many languages. My idea is to create an en_US_cache.txt file of these most frequently used 100 words in en_US., I would like to do that for as many of the other languages that Sigil supports. With those files I would like to test how much faster we can make CodeView spellchecking by loading and using these caches to test. If the speedup is significant enough to warrant the extra code of keeping these cache lists, then I can go ahead and add them to Sigil. But I want to test as many other languages as possible so that the speedup is not just for en-US. If anyone has such a list (one word per line, utf-8 encoded) for any other language and is willing to post it, I would be grateful. If not enough of these most frequently used word lists are available, my second idea is to cache the last 100 words spellchecked. I do not expect it to provide the same speedup factor but it might be worth a try if the frequent word lists are not generally available for most languages. Any thoughts, links to word lists, word lists, or feedback on the idea greatly welcome. |
04-16-2021, 12:27 PM | #2 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Yes, these are called Stop Words:
https://en.wikipedia.org/wiki/Stop_words They're used in things like search engines to ignore common words like "of" or "the". Natural Language Toolkit (NLTK) should have a list of stopwords for different languages: https://www.nltk.org/ And according to this Stack Exchange question: "NLTK available languages for stopwords" the files should be located in: C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords I also was poking around the NLTK site, and think I found the stopwords as a separate download: https://www.nltk.org/nltk_data/ Download the "Stopwords Corpus", then extracting the ZIP, you get files for: arabic azerbaijani danish dutch english finnish french german greek hungarian indonesian italian kazakh nepali norwegian portuguese romanian russian slovene spanish swedish tajik turkish If you want top 100, you could also use (for English at least) a list of common words: https://en.wikipedia.org/wiki/Most_c...rds_in_English Related Side Note: There's this fascinating thing called Zipf's Law: 1st most commonly used word is used a huge % of the time. 2nd place = ~1/2 1st place. 3rd place = ~1/3 1st place. 4th place = ~1/4 1st place. [...] It exponentially goes down, but all languages follow the same pattern. The top 5 words are used ~25% of the time, and the top 25 words are used ~50% of the time. For more information on that, check out VSauce's fantastic video: "The Zipf Mystery". You could also see this in actual action in a few of my Reddit posts:
(Plus the link in my signature... definitely coming soon!*) Last edited by Tex2002ans; 04-16-2021 at 01:01 PM. |
Advert | |
|
04-16-2021, 12:29 PM | #3 |
Grand Sorcerer
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
There's a stop-words Python package with UTF8 encoded stop word lists for 23 languages that you might find helpful.
|
04-16-2021, 01:23 PM | #4 |
Sigil Developer
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Wonderful!
Thank you both. I will grab these and change Sigil to use them and try to measure its impact on CodeView spellcheck speed. Thanks |
04-16-2021, 02:36 PM | #5 |
Sigil Developer
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Well testing with Tex2002ans's huge testcase of merging all those chapter together and loading it which includes spellchecking, I compared two versions of Sigil.
One checked the stopWords list first and the other did not. The timings to load and spellcheck that huge merged chapter were almost identical. 8.5 seconds on my laptop for both. So the overhead of checking for stop words just compensates for any speed-up in handling those words. So it is not an effective speedup. Thanks for the stop word list links. I will keep them in case there is another way they could be used. |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
One-touch look-up of words in foreign language book | andrewkirk | Kobo Reader | 8 | 06-09-2015 08:20 AM |
10x10 / 100 Words and Pictures that Define the Time | Colin Dunstan | Lounge | 2 | 12-16-2013 04:36 PM |
Can you write your life story in 100 words. | kennyc | Writers' Corner | 14 | 10-13-2013 12:50 PM |
Help : converting from EPUB to FB2 : spacing between words is frequently missing | q345 | Calibre | 1 | 09-18-2010 11:41 AM |