MobileRead Forums - View Single Post - 100 most frequently used words by language

Tex2002ans · 04-16-2021, 12:27 PM

Yes, these are called Stop Words:

https://en.wikipedia.org/wiki/Stop_words

They're used in things like search engines to ignore common words like "of" or "the".

Natural Language Toolkit (NLTK) should have a list of stopwords for different languages:

https://www.nltk.org/

And according to this Stack Exchange question:

"NLTK available languages for stopwords"

the files should be located in:

C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords

I also was poking around the NLTK site, and think I found the stopwords as a separate download:

https://www.nltk.org/nltk_data/

Download the "Stopwords Corpus", then extracting the ZIP, you get files for:

arabic
azerbaijani
danish
dutch
english
finnish
french
german
greek
hungarian
indonesian
italian
kazakh
nepali
norwegian
portuguese
romanian
russian
slovene
spanish
swedish
tajik
turkish

If you want top 100, you could also use (for English at least) a list of common words:

https://en.wikipedia.org/wiki/Most_c...rds_in_English

Related Side Note: There's this fascinating thing called Zipf's Law:

1st most commonly used word is used a huge % of the time.
2nd place = ~1/2 1st place.
3rd place = ~1/3 1st place.
4th place = ~1/4 1st place.
[...]

It exponentially goes down, but all languages follow the same pattern.

The top 5 words are used ~25% of the time, and the top 25 words are used ~50% of the time.

For more information on that, check out VSauce's fantastic video: "The Zipf Mystery".

You could also see this in actual action in a few of my Reddit posts:

(Plus the link in my signature... definitely coming soon!*)

04-16-2021, 12:27 PM	#2
Tex2002ans Wizard Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook	Yes, these are called Stop Words: https://en.wikipedia.org/wiki/Stop_words They're used in things like search engines to ignore common words like "of" or "the". Natural Language Toolkit (NLTK) should have a list of stopwords for different languages: https://www.nltk.org/ And according to this Stack Exchange question: "NLTK available languages for stopwords" the files should be located in: C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords I also was poking around the NLTK site, and think I found the stopwords as a separate download: https://www.nltk.org/nltk_data/ Download the "Stopwords Corpus", then extracting the ZIP, you get files for: arabic azerbaijani danish dutch english finnish french german greek hungarian indonesian italian kazakh nepali norwegian portuguese romanian russian slovene spanish swedish tajik turkish If you want top 100, you could also use (for English at least) a list of common words: https://en.wikipedia.org/wiki/Most_c...rds_in_English Related Side Note: There's this fascinating thing called Zipf's Law: 1st most commonly used word is used a huge % of the time. 2nd place = ~1/2 1st place. 3rd place = ~1/3 1st place. 4th place = ~1/4 1st place. [...] It exponentially goes down, but all languages follow the same pattern. The top 5 words are used ~25% of the time, and the top 25 words are used ~50% of the time. For more information on that, check out VSauce's fantastic video: "The Zipf Mystery". You could also see this in actual action in a few of my Reddit posts: /r/writing: "How much variety should there be among initial words in sentences?" /r/writing: "How many 'The' sentence starters is too much?" (Plus the link in my signature... definitely coming soon!) Last edited by Tex2002ans; 04-16-2021 at 01:01 PM.*