![]() |
#436 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
|
Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.
I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary. Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books. I don't know if anyone is interested in this, any suggestions? Or even better, find existing similar word prevalence data for non-English languages... I already find some word frequency data for non-English languages, using these data is way more easier. Last edited by xxyzz; 08-15-2022 at 09:55 AM. |
![]() |
![]() |
![]() |
#437 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
|
Quote:
|
|
![]() |
![]() |
![]() |
#438 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
|
Quote:
Maybe we can calculate the word occurrence frequency from Google's data for languages that didn't filtered with a spellchecker in Wordlex and only enable words that have frequency lower than a threshold. Which datasets do you think is more suitable for disabling easy words in Wiktionary? Or maybe you find some better datasets please let me know, because I think word frequency is not very accuracy compared to other metrics. Last edited by xxyzz; 08-15-2022 at 10:49 PM. |
|
![]() |
![]() |
![]() |
#439 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
|
I've been checking what you're saying and maybe the approach would be to use data that's on the internet rather than doing the work all over again.
For example, with reference to Wedlex, I have been studying the Spanish language and I think it is quite good. A problem could be that there are a lot of words. Many of them are derivations of other. As you say, the issue is to take the thresholds, but my proposal is that they be dynamic thresholds. I explain. In the Spanish language there are 208,078 entries in Wedlex. By default, equal thresholds could be set for the 5 levels of difficulty. 208,078/5 = 41615 entries per level. If I look at the entries since 41615, it may seem to me that the thresholds are not correct. Well, it could be that each person decided the threshold based on percentages. For example: Level 1: 0-10% Level 2: 20-50% Level 3: 50-60% Level 4: 60-70% Level 5: 70-100% But, the user could change the thresholds and the level should change. I don't know what you think. Last edited by Shark69; 08-19-2022 at 02:45 AM. |
![]() |
![]() |
![]() |
#440 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
|
WorldLex's *CDPc values drop sharply after a few rows, most values are below one, these values looks like a "percentage" number. I not sure the meaning of *Freq and *FreqPm columns.
Google's Ngram has more words and is released more recently, but the frequency data needs to be computed from the "1-grams" files and the "Total counts for 1-grams" file. According to the Ngram viewer, the frequencies of google's data is also mostly below one. Wiktionary also has many word frequency lists, they don't have frequency data though: https://en.wiktionary.org/wiki/Categ...ts_by_language https://es.wiktionary.org/wiki/Wikci...de_frecuencias I not sure which data source is better then the others. I'm planning to release a new version so this feature probably will be added in a future release. Last edited by xxyzz; 08-20-2022 at 10:16 AM. |
![]() |
![]() |
![]() |
#441 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
|
Quote:
Thanks |
|
![]() |
![]() |
![]() |
#442 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
|
Check out this Wiktionary page:https://en.wiktionary.org/wiki/Wikti...requency_lists
I find two language proficiency test lists that can be converted to difficulty list, they happen to group the words by difficulty to four or five levels: Chinese: https://en.wiktionary.org/wiki/Appen...Mandarin_words Japanese: https://en.wiktionary.org/wiki/Appendix:JLPT The page doesn't have any Spanish test list, it has some frequency lists but again it's kind hard to set difficulty value based on their frequencies. Edit: For Chinese data, I found an excellent Excel file from a link in the TOCFL website(https://tocfl.edu.tw/index.php/teach/download): https://coct.naer.edu.tw/download/tech_report/ Last edited by xxyzz; 08-21-2022 at 02:53 AM. |
![]() |
![]() |
![]() |
#443 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
|
It sounds promising. It would be nice if there were similar lists for other languages. In my case, the interest is focused on the distribution of the complexity of the words of the English language more than in Spanish because it is the one that I do not understand perfectly. There are several lists in English, but it would be nice to find a list with these levels. It would be perfect. But I think we're digressing because I can't figure out how to implement those kinds of lists in the plugin. I don't know if you have something in mind.
|
![]() |
![]() |
![]() |
#444 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
|
You can try the test plugin, I've added difficulty values to English Wiktionary and you can set the difficulty limit. But you can't re-enable words that have higher difficulty value after disabling them and saving the changes, for example: set limit to 4 then save, reopen the dialog and set limit to 5 but previously enabled words that has difficulty level of 5 won't be enabled again.
|
![]() |
![]() |
![]() |
#445 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
|
Quote:
|
|
![]() |
![]() |
![]() |
#446 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
|
I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time.
I also pushed some changes to re-enable words that higher difficulty values. spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch. If I didn't break something, I hope I can create a new release in the following days. |
![]() |
![]() |
![]() |
#447 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
|
Quote:
|
|
![]() |
![]() |
![]() |
#448 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 40
Karma: 1000
Join Date: Jun 2022
Device: Kindle Oasis
|
Can I customize titles or meaning?
I found out that we can customize difficulty level of words.
But I couldn't find out if I could customize the word titles or definitions. What I am trying to do is making my own dictionary from which the worddumb is made from. I want to add English words one by one with meaning in Korean. Will it be possible? |
![]() |
![]() |
![]() |
#449 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
|
Quote:
1. When I generate all files, they are generated only for epub. The dialog to choose format has dissapeard and it is imposible to generate X-Ray for Kindle formats. 2. I have the big spaCy model downloded. After installing the version, the preferences window tell me that I have the medium one, but it is not true. I've compared with previous files and all of them are the same. If I change spaCy to big again.... nothings is done.... because I've already got it. No more things so far. It is perfect and works fine the possibilitu to re-enable words that higher difficulty values. Very important Last edited by Shark69; 08-22-2022 at 12:07 PM. |
|
![]() |
![]() |
![]() |
#450 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
|
Quote:
Edit manually is possible but tedious. Last edited by xxyzz; 08-22-2022 at 10:48 PM. |
|
![]() |
![]() |
![]() |
Tags |
worddumb, x-ray |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
[GUI Plugin] KindleUnpack - The Plugin | DiapDealer | Plugins | 523 | 07-15-2025 06:45 PM |
[GUI Plugin] CalibreSpy | DaltonST | Plugins | 245 | 08-18-2024 09:33 PM |
[GUI Plugin] Manga plugin | mastertea | Plugins | 6 | 01-06-2022 02:43 AM |
[GUI Plugin] Save Virtual Libraries To Column (GUI) | chaley | Plugins | 14 | 04-04-2021 05:25 AM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |