[GUI Plugin] WordDumb - Page 30

xxyzz · 08-15-2022, 09:36 AM

Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.

I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary.

Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books.

I don't know if anyone is interested in this, any suggestions?

Or even better, find existing similar word prevalence data for non-English languages...

I already find some word frequency data for non-English languages, using these data is way more easier.

Shark69 · 08-15-2022, 11:29 AM

Quote:

Originally Posted by xxyzz

Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.

I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary.

Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books.

I don't know if anyone is interested in this, any suggestions?

Or even better, find existing similar word prevalence data for non-English languages...

I already find some word frequency data for non-English languages, using these data is way more easier.

I'm interested in the project. I'd would like to help you I am interested in the project. I would like to help you to the best of my ability.

xxyzz · 08-15-2022, 10:45 PM

Quote:

Originally Posted by Shark69

I'm interested in the project. I'd would like to help you I am interested in the project. I would like to help you to the best of my ability.

I think we should use currently available data since there are already many researchers working on this topic. I find some useful data:

Maybe we can calculate the word occurrence frequency from Google's data for languages that didn't filtered with a spellchecker in Wordlex and only enable words that have frequency lower than a threshold.

Which datasets do you think is more suitable for disabling easy words in Wiktionary? Or maybe you find some better datasets please let me know, because I think word frequency is not very accuracy compared to other metrics.

Shark69 · 08-19-2022, 02:35 AM

I've been checking what you're saying and maybe the approach would be to use data that's on the internet rather than doing the work all over again.
For example, with reference to Wedlex, I have been studying the Spanish language and I think it is quite good.
A problem could be that there are a lot of words. Many of them are derivations of other.
As you say, the issue is to take the thresholds, but my proposal is that they be dynamic thresholds.
I explain. In the Spanish language there are 208,078 entries in Wedlex. By default, equal thresholds could be set for the 5 levels of difficulty. 208,078/5 = 41615 entries per level. If I look at the entries since 41615, it may seem to me that the thresholds are not correct. Well, it could be that each person decided the threshold based on percentages.
For example:
Level 1: 0-10%
Level 2: 20-50%
Level 3: 50-60%
Level 4: 60-70%
Level 5: 70-100%
But, the user could change the thresholds and the level should change.
I don't know what you think.

xxyzz · 08-20-2022, 09:57 AM

WorldLex's *CDPc values drop sharply after a few rows, most values are below one, these values looks like a "percentage" number. I not sure the meaning of *Freq and *FreqPm columns.

Google's Ngram has more words and is released more recently, but the frequency data needs to be computed from the "1-grams" files and the "Total counts for 1-grams" file. According to the Ngram viewer, the frequencies of google's data is also mostly below one.

Wiktionary also has many word frequency lists, they don't have frequency data though: https://en.wiktionary.org/wiki/Categ...ts_by_language
https://es.wiktionary.org/wiki/Wikci...de_frecuencias

I not sure which data source is better then the others. I'm planning to release a new version so this feature probably will be added in a future release.

Shark69 · 08-20-2022, 03:47 PM

Quote:

Originally Posted by xxyzz

WorldLex's *CDPc values drop sharply after a few rows, most values are below one, these values looks like a "percentage" number. I not sure the meaning of *Freq and *FreqPm columns.

Google's Ngram has more words and is released more recently, but the frequency data needs to be computed from the "1-grams" files and the "Total counts for 1-grams" file. According to the Ngram viewer, the frequencies of google's data is also mostly below one.

Wiktionary also has many word frequency lists, they don't have frequency data though: https://en.wiktionary.org/wiki/Categ...ts_by_language
https://es.wiktionary.org/wiki/Wikci...de_frecuencias

I not sure which data source is better then the others. I'm planning to release a new version so this feature probably will be added in a future release.

Yes, it is quite difficult to guess the better way to face it. I only can support your idea and contribute when possible.
Thanks

xxyzz · 08-20-2022, 09:04 PM

Check out this Wiktionary page:https://en.wiktionary.org/wiki/Wikti...requency_lists

I find two language proficiency test lists that can be converted to difficulty list, they happen to group the words by difficulty to four or five levels:
Chinese: https://en.wiktionary.org/wiki/Appen...Mandarin_words
Japanese: https://en.wiktionary.org/wiki/Appendix:JLPT

The page doesn't have any Spanish test list, it has some frequency lists but again it's kind hard to set difficulty value based on their frequencies.

Edit:
For Chinese data, I found an excellent Excel file from a link in the TOCFL website(https://tocfl.edu.tw/index.php/teach/download): https://coct.naer.edu.tw/download/tech_report/

Shark69 · 08-21-2022, 04:04 AM

It sounds promising. It would be nice if there were similar lists for other languages. In my case, the interest is focused on the distribution of the complexity of the words of the English language more than in Spanish because it is the one that I do not understand perfectly. There are several lists in English, but it would be nice to find a list with these levels. It would be perfect. But I think we're digressing because I can't figure out how to implement those kinds of lists in the plugin. I don't know if you have something in mind.

xxyzz · 08-21-2022, 06:29 AM

You can try the test plugin, I've added difficulty values to English Wiktionary and you can set the difficulty limit. But you can't re-enable words that have higher difficulty value after disabling them and saving the changes, for example: set limit to 4 then save, reopen the dialog and set limit to 5 but previously enabled words that has difficulty level of 5 won't be enabled again.

Shark69 · 08-21-2022, 12:27 PM

Quote:

Originally Posted by xxyzz

You can try the test plugin, I've added difficulty values to English Wiktionary and you can set the difficulty limit. But you can't re-enable words that have higher difficulty value after disabling them and saving the changes, for example: set limit to 4 then save, reopen the dialog and set limit to 5 but previously enabled words that has difficulty level of 5 won't be enabled again.

Ok. I'll try it and keep you updated.

xxyzz · 08-22-2022, 12:50 AM

I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time.

I also pushed some changes to re-enable words that higher difficulty values.

spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch.

If I didn't break something, I hope I can create a new release in the following days.

Shark69 · 08-22-2022, 01:54 AM

Quote:

Originally Posted by xxyzz

I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time.

I also pushed some changes to re-enable words that higher difficulty values.

spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch.

If I didn't break something, I hope I can create a new release in the following days.

Thx. I'll give a try this afternoon.

beecom · 08-22-2022, 10:49 AM

I found out that we can customize difficulty level of words.
But I couldn't find out if I could customize the word titles or definitions.

What I am trying to do is making my own dictionary from which the worddumb is made from. I want to add English words one by one with meaning in Korean.

Will it be possible?

Shark69 · 08-22-2022, 12:02 PM

Quote:

Originally Posted by xxyzz

I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time.

I also pushed some changes to re-enable words that higher difficulty values.

spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch.

If I didn't break something, I hope I can create a new release in the following days.

Hi, I've just giving a try to the version. It's is very nice, but I've found several things:

1. When I generate all files, they are generated only for epub. The dialog to choose format has dissapeard and it is imposible to generate X-Ray for Kindle formats.

2. I have the big spaCy model downloded. After installing the version, the preferences window tell me that I have the medium one, but it is not true. I've compared with previous files and all of them are the same. If I change spaCy to big again.... nothings is done.... because I've already got it. No more things so far.

It is perfect and works fine the possibilitu to re-enable words that higher difficulty values.
Very important

xxyzz · 08-22-2022, 07:25 PM

Quote:

Originally Posted by beecom

But I couldn't find out if I could customize the word titles or definitions.

What I am trying to do is making my own dictionary from which the worddumb is made from. I want to add English words one by one with meaning in Korean.

Will it be possible?

You can double click the definition to edit it in the EPUB Wiktionary dialog and click the "Save" button to save the change. Edit Kindle's definition is more difficult because the definition data are stored in the Kindle Word Wise database file.

Edit manually is possible but tedious.

08-15-2022, 09:36 AM	#436
xxyzz Evangelist Posts: 444 Karma: 3000000 Join Date: Nov 2020 Device: none	Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible. I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary. Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books. I don't know if anyone is interested in this, any suggestions? Or even better, find existing similar word prevalence data for non-English languages... I already find some word frequency data for non-English languages, using these data is way more easier. Last edited by xxyzz; 08-15-2022 at 09:55 AM.

08-19-2022, 02:35 AM	#439
Shark69 Zealot Posts: 136 Karma: 493152 Join Date: Mar 2012 Location: Spain Device: Kindle Oasis 2	I've been checking what you're saying and maybe the approach would be to use data that's on the internet rather than doing the work all over again. For example, with reference to Wedlex, I have been studying the Spanish language and I think it is quite good. A problem could be that there are a lot of words. Many of them are derivations of other. As you say, the issue is to take the thresholds, but my proposal is that they be dynamic thresholds. I explain. In the Spanish language there are 208,078 entries in Wedlex. By default, equal thresholds could be set for the 5 levels of difficulty. 208,078/5 = 41615 entries per level. If I look at the entries since 41615, it may seem to me that the thresholds are not correct. Well, it could be that each person decided the threshold based on percentages. For example: Level 1: 0-10% Level 2: 20-50% Level 3: 50-60% Level 4: 60-70% Level 5: 70-100% But, the user could change the thresholds and the level should change. I don't know what you think. Last edited by Shark69; 08-19-2022 at 02:45 AM.

08-20-2022, 09:57 AM	#440
xxyzz Evangelist Posts: 444 Karma: 3000000 Join Date: Nov 2020 Device: none	WorldLex's CDPc values drop sharply after a few rows, most values are below one, these values looks like a "percentage" number. I not sure the meaning of Freq and FreqPm columns. Google's Ngram has more words and is released more recently, but the frequency data needs to be computed from the "1-grams" files and the "Total counts for 1-grams" file. According to the Ngram viewer, the frequencies of google's data is also mostly below one. Wiktionary also has many word frequency lists, they don't have frequency data though: https://en.wiktionary.org/wiki/Categ...ts_by_language https://es.wiktionary.org/wiki/Wikci...de_frecuencias I not sure which data source is better then the others. I'm planning to release a new version so this feature probably will be added in a future release. Last edited by xxyzz; 08-20-2022 at 10:16 AM.*

08-20-2022, 09:04 PM	#442
xxyzz Evangelist Posts: 444 Karma: 3000000 Join Date: Nov 2020 Device: none	Check out this Wiktionary page:https://en.wiktionary.org/wiki/Wikti...requency_lists I find two language proficiency test lists that can be converted to difficulty list, they happen to group the words by difficulty to four or five levels: Chinese: https://en.wiktionary.org/wiki/Appen...Mandarin_words Japanese: https://en.wiktionary.org/wiki/Appendix:JLPT The page doesn't have any Spanish test list, it has some frequency lists but again it's kind hard to set difficulty value based on their frequencies. Edit: For Chinese data, I found an excellent Excel file from a link in the TOCFL website(https://tocfl.edu.tw/index.php/teach/download): https://coct.naer.edu.tw/download/tech_report/ Last edited by xxyzz; 08-21-2022 at 02:53 AM.

08-22-2022, 10:49 AM	#448
beecom Enthusiast Posts: 40 Karma: 1000 Join Date: Jun 2022 Device: Kindle Oasis	Can I customize titles or meaning? I found out that we can customize difficulty level of words. But I couldn't find out if I could customize the word titles or definitions. What I am trying to do is making my own dictionary from which the worddumb is made from. I want to add English words one by one with meaning in Korean. Will it be possible?

08-21-2022, 04:04 AM	#443
Shark69 Zealot Posts: 136 Karma: 493152 Join Date: Mar 2012 Location: Spain Device: Kindle Oasis 2	It sounds promising. It would be nice if there were similar lists for other languages. In my case, the interest is focused on the distribution of the complexity of the words of the English language more than in Spanish because it is the one that I do not understand perfectly. There are several lists in English, but it would be nice to find a list with these levels. It would be perfect. But I think we're digressing because I can't figure out how to implement those kinds of lists in the plugin. I don't know if you have something in mind.

08-21-2022, 06:29 AM	#444
xxyzz Evangelist Posts: 444 Karma: 3000000 Join Date: Nov 2020 Device: none	You can try the test plugin, I've added difficulty values to English Wiktionary and you can set the difficulty limit. But you can't re-enable words that have higher difficulty value after disabling them and saving the changes, for example: set limit to 4 then save, reopen the dialog and set limit to 5 but previously enabled words that has difficulty level of 5 won't be enabled again.

08-22-2022, 12:50 AM	#446
xxyzz Evangelist Posts: 444 Karma: 3000000 Join Date: Nov 2020 Device: none	I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time. I also pushed some changes to re-enable words that higher difficulty values. spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch. If I didn't break something, I hope I can create a new release in the following days.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] KindleUnpack - The Plugin	DiapDealer	Plugins	527	08-15-2025 01:36 PM
[GUI Plugin] CalibreSpy	DaltonST	Plugins	245	08-18-2024 09:33 PM
[GUI Plugin] Manga plugin	mastertea	Plugins	6	01-06-2022 02:43 AM
[GUI Plugin] Save Virtual Libraries To Column (GUI)	chaley	Plugins	14	04-04-2021 05:25 AM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM