View Single Post
Old 04-16-2021, 12:27 PM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Yes, these are called Stop Words:

https://en.wikipedia.org/wiki/Stop_words

They're used in things like search engines to ignore common words like "of" or "the".

Natural Language Toolkit (NLTK) should have a list of stopwords for different languages:

https://www.nltk.org/

And according to this Stack Exchange question:

"NLTK available languages for stopwords"

the files should be located in:

C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords

I also was poking around the NLTK site, and think I found the stopwords as a separate download:

https://www.nltk.org/nltk_data/

Download the "Stopwords Corpus", then extracting the ZIP, you get files for:

arabic
azerbaijani
danish
dutch
english
finnish
french
german
greek
hungarian
indonesian
italian
kazakh
nepali
norwegian
portuguese
romanian
russian
slovene
spanish
swedish
tajik
turkish

If you want top 100, you could also use (for English at least) a list of common words:

https://en.wikipedia.org/wiki/Most_c...rds_in_English

Related Side Note: There's this fascinating thing called Zipf's Law:

1st most commonly used word is used a huge % of the time.
2nd place = ~1/2 1st place.
3rd place = ~1/3 1st place.
4th place = ~1/4 1st place.
[...]

It exponentially goes down, but all languages follow the same pattern.

The top 5 words are used ~25% of the time, and the top 25 words are used ~50% of the time.

For more information on that, check out VSauce's fantastic video: "The Zipf Mystery".

You could also see this in actual action in a few of my Reddit posts:

(Plus the link in my signature... definitely coming soon!*)

Last edited by Tex2002ans; 04-16-2021 at 01:01 PM.
Tex2002ans is offline   Reply With Quote