Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 04-16-2021, 11:29 AM   #1
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
100 most frequently used words by language

Hi All,

After recent changes in Sigil, spell checking in CodeView is the biggest time user when loading a new file.

So does anyone know where I might find a list of the 100 most frequently used words in each language?

I have found a list for English that lists the expected ("the", "and", "a", "of") but could not find a list for most other languages. But of course a page may have that list in a language but I do not know how to read it so ...

Hunspell spellchecking is really not optimized for speed because it must handle words that have prefixes, suffixes, or are compound in many languages.

My idea is to create an en_US_cache.txt file of these most frequently used 100 words in en_US.,

I would like to do that for as many of the other languages that Sigil supports.

With those files I would like to test how much faster we can make CodeView spellchecking by loading and using these caches to test.

If the speedup is significant enough to warrant the extra code of keeping these cache lists, then I can go ahead and add them to Sigil.

But I want to test as many other languages as possible so that the speedup is not just for en-US.

If anyone has such a list (one word per line, utf-8 encoded) for any other language and is willing to post it, I would be grateful.

If not enough of these most frequently used word lists are available, my second idea is to cache the last 100 words spellchecked. I do not expect it to provide the same speedup factor but it might be worth a try if the frequent word lists are not generally available for most languages.

Any thoughts, links to word lists, word lists, or feedback on the idea greatly welcome.
KevinH is offline   Reply With Quote
Old 04-16-2021, 12:27 PM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Yes, these are called Stop Words:

https://en.wikipedia.org/wiki/Stop_words

They're used in things like search engines to ignore common words like "of" or "the".

Natural Language Toolkit (NLTK) should have a list of stopwords for different languages:

https://www.nltk.org/

And according to this Stack Exchange question:

"NLTK available languages for stopwords"

the files should be located in:

C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords

I also was poking around the NLTK site, and think I found the stopwords as a separate download:

https://www.nltk.org/nltk_data/

Download the "Stopwords Corpus", then extracting the ZIP, you get files for:

arabic
azerbaijani
danish
dutch
english
finnish
french
german
greek
hungarian
indonesian
italian
kazakh
nepali
norwegian
portuguese
romanian
russian
slovene
spanish
swedish
tajik
turkish

If you want top 100, you could also use (for English at least) a list of common words:

https://en.wikipedia.org/wiki/Most_c...rds_in_English

Related Side Note: There's this fascinating thing called Zipf's Law:

1st most commonly used word is used a huge % of the time.
2nd place = ~1/2 1st place.
3rd place = ~1/3 1st place.
4th place = ~1/4 1st place.
[...]

It exponentially goes down, but all languages follow the same pattern.

The top 5 words are used ~25% of the time, and the top 25 words are used ~50% of the time.

For more information on that, check out VSauce's fantastic video: "The Zipf Mystery".

You could also see this in actual action in a few of my Reddit posts:

(Plus the link in my signature... definitely coming soon!*)

Last edited by Tex2002ans; 04-16-2021 at 01:01 PM.
Tex2002ans is offline   Reply With Quote
Advert
Old 04-16-2021, 12:29 PM   #3
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
There's a stop-words Python package with UTF8 encoded stop word lists for 23 languages that you might find helpful.
Doitsu is offline   Reply With Quote
Old 04-16-2021, 01:23 PM   #4
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
Wonderful!
Thank you both.

I will grab these and change Sigil to use them and try to measure its impact on CodeView spellcheck speed.

Thanks
KevinH is offline   Reply With Quote
Old 04-16-2021, 02:36 PM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
Well testing with Tex2002ans's huge testcase of merging all those chapter together and loading it which includes spellchecking, I compared two versions of Sigil.

One checked the stopWords list first and the other did not.

The timings to load and spellcheck that huge merged chapter were almost identical.
8.5 seconds on my laptop for both.

So the overhead of checking for stop words just compensates for any speed-up in handling those words.

So it is not an effective speedup.

Thanks for the stop word list links. I will keep them in case there is another way they could be used.
KevinH is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
One-touch look-up of words in foreign language book andrewkirk Kobo Reader 8 06-09-2015 08:20 AM
10x10 / 100 Words and Pictures that Define the Time Colin Dunstan Lounge 2 12-16-2013 04:36 PM
Can you write your life story in 100 words. kennyc Writers' Corner 14 10-13-2013 12:50 PM
Help : converting from EPUB to FB2 : spacing between words is frequently missing q345 Calibre 1 09-18-2010 11:41 AM


All times are GMT -4. The time now is 02:31 PM.


MobileRead.com is a privately owned, operated and funded community.