Yes, these are called Stop Words:
https://en.wikipedia.org/wiki/Stop_words
They're used in things like search engines to ignore common words like "of" or "the".
Natural Language Toolkit (NLTK) should have a list of stopwords for different languages:
https://www.nltk.org/
And according to this Stack Exchange question:
"NLTK available languages for stopwords"
the files should be located in:
C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords
I also was poking around the NLTK site, and think I found the stopwords as a separate download:
https://www.nltk.org/nltk_data/
Download the "Stopwords Corpus", then extracting the ZIP, you get files for:
arabic
azerbaijani
danish
dutch
english
finnish
french
german
greek
hungarian
indonesian
italian
kazakh
nepali
norwegian
portuguese
romanian
russian
slovene
spanish
swedish
tajik
turkish
If you want top 100, you could also use (for English at least) a list of common words:
https://en.wikipedia.org/wiki/Most_c...rds_in_English
Related Side Note: There's this fascinating thing called
Zipf's Law:
1st most commonly used word is used a huge % of the time.
2nd place = ~1/2 1st place.
3rd place = ~1/3 1st place.
4th place = ~1/4 1st place.
[...]
It exponentially goes down, but all languages follow the same pattern.
The top 5 words are used ~25% of the time, and the top 25 words are used ~50% of the time.
For more information on that, check out
VSauce's fantastic video: "The Zipf Mystery".
You could also see this in actual action in a few of my Reddit posts:
(Plus the
link in my signature... definitely coming soon!*)