Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 08-15-2022, 09:36 AM   #436
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.

I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary.

Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books.

I don't know if anyone is interested in this, any suggestions?

Or even better, find existing similar word prevalence data for non-English languages...

I already find some word frequency data for non-English languages, using these data is way more easier.

Last edited by xxyzz; 08-15-2022 at 09:55 AM.
xxyzz is offline   Reply With Quote
Old 08-15-2022, 11:29 AM   #437
Shark69
Zealot
Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.
 
Shark69's Avatar
 
Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
Quote:
Originally Posted by xxyzz View Post
Lately I was thinking maybe we can use the same method in that prevalence paper(http://btjohns.com/pubs/JDJ_QJEP_2020.pdf) to calculate data for other languages. I know for sure there are some ebook hoarders in the forum, so maybe gathering data from tens of thousands books is possible.

I already know how to get text from ebook files, then I need to figure out how to calculate the SD-AP value, and finally convert the result data to difficulty value(this step is also already completed). The plugin can use this data to only enable hard words in Wiktionary.

Clearly I need help for this project, especially for reproducing that paper and data gathering. Maybe we can create a command line tool to gather data from book files and save the data to a sqlite file so anyone can contribute with their books.

I don't know if anyone is interested in this, any suggestions?

Or even better, find existing similar word prevalence data for non-English languages...

I already find some word frequency data for non-English languages, using these data is way more easier.
I'm interested in the project. I'd would like to help you I am interested in the project. I would like to help you to the best of my ability.
Shark69 is offline   Reply With Quote
Old 08-15-2022, 10:45 PM   #438
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
Quote:
Originally Posted by Shark69 View Post
I'm interested in the project. I'd would like to help you I am interested in the project. I would like to help you to the best of my ability.
I think we should use currently available data since there are already many researchers working on this topic. I find some useful data:
Maybe we can calculate the word occurrence frequency from Google's data for languages that didn't filtered with a spellchecker in Wordlex and only enable words that have frequency lower than a threshold.

Which datasets do you think is more suitable for disabling easy words in Wiktionary? Or maybe you find some better datasets please let me know, because I think word frequency is not very accuracy compared to other metrics.

Last edited by xxyzz; 08-15-2022 at 10:49 PM.
xxyzz is offline   Reply With Quote
Old 08-19-2022, 02:35 AM   #439
Shark69
Zealot
Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.
 
Shark69's Avatar
 
Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
I've been checking what you're saying and maybe the approach would be to use data that's on the internet rather than doing the work all over again.
For example, with reference to Wedlex, I have been studying the Spanish language and I think it is quite good.
A problem could be that there are a lot of words. Many of them are derivations of other.
As you say, the issue is to take the thresholds, but my proposal is that they be dynamic thresholds.
I explain. In the Spanish language there are 208,078 entries in Wedlex. By default, equal thresholds could be set for the 5 levels of difficulty. 208,078/5 = 41615 entries per level. If I look at the entries since 41615, it may seem to me that the thresholds are not correct. Well, it could be that each person decided the threshold based on percentages.
For example:
Level 1: 0-10%
Level 2: 20-50%
Level 3: 50-60%
Level 4: 60-70%
Level 5: 70-100%
But, the user could change the thresholds and the level should change.
I don't know what you think.

Last edited by Shark69; 08-19-2022 at 02:45 AM.
Shark69 is offline   Reply With Quote
Old 08-20-2022, 09:57 AM   #440
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
WorldLex's *CDPc values drop sharply after a few rows, most values are below one, these values looks like a "percentage" number. I not sure the meaning of *Freq and *FreqPm columns.

Google's Ngram has more words and is released more recently, but the frequency data needs to be computed from the "1-grams" files and the "Total counts for 1-grams" file. According to the Ngram viewer, the frequencies of google's data is also mostly below one.

Wiktionary also has many word frequency lists, they don't have frequency data though: https://en.wiktionary.org/wiki/Categ...ts_by_language
https://es.wiktionary.org/wiki/Wikci...de_frecuencias

I not sure which data source is better then the others. I'm planning to release a new version so this feature probably will be added in a future release.

Last edited by xxyzz; 08-20-2022 at 10:16 AM.
xxyzz is offline   Reply With Quote
Old 08-20-2022, 03:47 PM   #441
Shark69
Zealot
Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.
 
Shark69's Avatar
 
Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
Quote:
Originally Posted by xxyzz View Post
WorldLex's *CDPc values drop sharply after a few rows, most values are below one, these values looks like a "percentage" number. I not sure the meaning of *Freq and *FreqPm columns.

Google's Ngram has more words and is released more recently, but the frequency data needs to be computed from the "1-grams" files and the "Total counts for 1-grams" file. According to the Ngram viewer, the frequencies of google's data is also mostly below one.

Wiktionary also has many word frequency lists, they don't have frequency data though: https://en.wiktionary.org/wiki/Categ...ts_by_language
https://es.wiktionary.org/wiki/Wikci...de_frecuencias


I not sure which data source is better then the others. I'm planning to release a new version so this feature probably will be added in a future release.
Yes, it is quite difficult to guess the better way to face it. I only can support your idea and contribute when possible.
Thanks
Shark69 is offline   Reply With Quote
Old 08-20-2022, 09:04 PM   #442
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
Check out this Wiktionary page:https://en.wiktionary.org/wiki/Wikti...requency_lists

I find two language proficiency test lists that can be converted to difficulty list, they happen to group the words by difficulty to four or five levels:
Chinese: https://en.wiktionary.org/wiki/Appen...Mandarin_words
Japanese: https://en.wiktionary.org/wiki/Appendix:JLPT

The page doesn't have any Spanish test list, it has some frequency lists but again it's kind hard to set difficulty value based on their frequencies.

Edit:
For Chinese data, I found an excellent Excel file from a link in the TOCFL website(https://tocfl.edu.tw/index.php/teach/download): https://coct.naer.edu.tw/download/tech_report/

Last edited by xxyzz; 08-21-2022 at 02:53 AM.
xxyzz is offline   Reply With Quote
Old 08-21-2022, 04:04 AM   #443
Shark69
Zealot
Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.
 
Shark69's Avatar
 
Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
It sounds promising. It would be nice if there were similar lists for other languages. In my case, the interest is focused on the distribution of the complexity of the words of the English language more than in Spanish because it is the one that I do not understand perfectly. There are several lists in English, but it would be nice to find a list with these levels. It would be perfect. But I think we're digressing because I can't figure out how to implement those kinds of lists in the plugin. I don't know if you have something in mind.
Shark69 is offline   Reply With Quote
Old 08-21-2022, 06:29 AM   #444
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
You can try the test plugin, I've added difficulty values to English Wiktionary and you can set the difficulty limit. But you can't re-enable words that have higher difficulty value after disabling them and saving the changes, for example: set limit to 4 then save, reopen the dialog and set limit to 5 but previously enabled words that has difficulty level of 5 won't be enabled again.
xxyzz is offline   Reply With Quote
Old 08-21-2022, 12:27 PM   #445
Shark69
Zealot
Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.
 
Shark69's Avatar
 
Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
Quote:
Originally Posted by xxyzz View Post
You can try the test plugin, I've added difficulty values to English Wiktionary and you can set the difficulty limit. But you can't re-enable words that have higher difficulty value after disabling them and saving the changes, for example: set limit to 4 then save, reopen the dialog and set limit to 5 but previously enabled words that has difficulty level of 5 won't be enabled again.
Ok. I'll try it and keep you updated.
Shark69 is offline   Reply With Quote
Old 08-22-2022, 12:50 AM   #446
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time.

I also pushed some changes to re-enable words that higher difficulty values.

spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch.

If I didn't break something, I hope I can create a new release in the following days.
xxyzz is offline   Reply With Quote
Old 08-22-2022, 01:54 AM   #447
Shark69
Zealot
Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.
 
Shark69's Avatar
 
Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
Quote:
Originally Posted by xxyzz View Post
I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time.

I also pushed some changes to re-enable words that higher difficulty values.

spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch.

If I didn't break something, I hope I can create a new release in the following days.
Thx. I'll give a try this afternoon.
Shark69 is offline   Reply With Quote
Old 08-22-2022, 10:49 AM   #448
beecom
Enthusiast
beecom can extract oil from cheesebeecom can extract oil from cheesebeecom can extract oil from cheesebeecom can extract oil from cheesebeecom can extract oil from cheesebeecom can extract oil from cheesebeecom can extract oil from cheesebeecom can extract oil from cheese
 
Posts: 40
Karma: 1000
Join Date: Jun 2022
Device: Kindle Oasis
Can I customize titles or meaning?

I found out that we can customize difficulty level of words.
But I couldn't find out if I could customize the word titles or definitions.

What I am trying to do is making my own dictionary from which the worddumb is made from. I want to add English words one by one with meaning in Korean.

Will it be possible?
beecom is offline   Reply With Quote
Old 08-22-2022, 12:02 PM   #449
Shark69
Zealot
Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.Shark69 ought to be getting tired of karma fortunes by now.
 
Shark69's Avatar
 
Posts: 136
Karma: 493152
Join Date: Mar 2012
Location: Spain
Device: Kindle Oasis 2
Quote:
Originally Posted by xxyzz View Post
I create a new GitHub repo Proficiency to create Word Wise files. WordDumb will download Wiktionary files from this repo's releases. These files are processed and compressed so the plugin won't need to download and extract the kakki.org JSON anymore, which saves so much time.

I also pushed some changes to re-enable words that higher difficulty values.

spaCy recently added Ukrainian model, I have added Ukrainian support to the master branch.

If I didn't break something, I hope I can create a new release in the following days.
Hi, I've just giving a try to the version. It's is very nice, but I've found several things:

1. When I generate all files, they are generated only for epub. The dialog to choose format has dissapeard and it is imposible to generate X-Ray for Kindle formats.

2. I have the big spaCy model downloded. After installing the version, the preferences window tell me that I have the medium one, but it is not true. I've compared with previous files and all of them are the same. If I change spaCy to big again.... nothings is done.... because I've already got it. No more things so far.

It is perfect and works fine the possibilitu to re-enable words that higher difficulty values.
Very important

Last edited by Shark69; 08-22-2022 at 12:07 PM.
Shark69 is offline   Reply With Quote
Old 08-22-2022, 07:25 PM   #450
xxyzz
Evangelist
xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.xxyzz ought to be getting tired of karma fortunes by now.
 
Posts: 442
Karma: 2666666
Join Date: Nov 2020
Device: none
Quote:
Originally Posted by beecom View Post
But I couldn't find out if I could customize the word titles or definitions.

What I am trying to do is making my own dictionary from which the worddumb is made from. I want to add English words one by one with meaning in Korean.

Will it be possible?
You can double click the definition to edit it in the EPUB Wiktionary dialog and click the "Save" button to save the change. Edit Kindle's definition is more difficult because the definition data are stored in the Kindle Word Wise database file.

Edit manually is possible but tedious.

Last edited by xxyzz; 08-22-2022 at 10:48 PM.
xxyzz is offline   Reply With Quote
Reply

Tags
worddumb, x-ray


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] KindleUnpack - The Plugin DiapDealer Plugins 523 07-15-2025 06:45 PM
[GUI Plugin] CalibreSpy DaltonST Plugins 245 08-18-2024 09:33 PM
[GUI Plugin] Manga plugin mastertea Plugins 6 01-06-2022 02:43 AM
[GUI Plugin] Save Virtual Libraries To Column (GUI) chaley Plugins 14 04-04-2021 05:25 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM


All times are GMT -4. The time now is 07:40 AM.


MobileRead.com is a privately owned, operated and funded community.