View Single Post
Old 08-15-2022, 09:36 AM   #436
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.

I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary.

Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books.

I don't know if anyone is interested in this, any suggestions?

Or even better, find existing similar word prevalence data for non-English languages...

I already find some word frequency data for non-English languages, using these data is way more easier.

Last edited by xxyzz; 08-15-2022 at 09:55 AM.
xxyzz is offline   Reply With Quote