Lately I was thinking maybe we can use the same method in that prevalence paper(
http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.
I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary.
Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books.
I don't know if anyone is interested in this, any suggestions?
Or even better, find existing similar word prevalence data for non-English languages...
I already find some word frequency data for non-English languages, using these data is way more easier.