MobileRead Forums - View Single Post

xxyzz · 08-15-2022, 10:36 AM

Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.

I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary.

Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books.

I don't know if anyone is interested in this, any suggestions?

Or even better, find existing similar word prevalence data for non-English languages...

I already find some word frequency data for non-English languages, using these data is way more easier.

08-15-2022, 10:36 AM	#436
xxyzz Evangelist Posts: 448 Karma: 3000000 Join Date: Nov 2020 Device: none	Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible. I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary. Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books. I don't know if anyone is interested in this, any suggestions? Or even better, find existing similar word prevalence data for non-English languages... I already find some word frequency data for non-English languages, using these data is way more easier. Last edited by xxyzz; 08-15-2022 at 10:55 AM.