View Single Post
Old 01-18-2022, 11:39 AM   #8
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,841
Karma: 6120478
Join Date: Nov 2009
Device: many
For the record, it took a while but I was able to unpack the current Hunspell en_US.aff and en_US.dic into its "working list" of words using repeated use of Hunspell's "wordforms" one root word at a time. It took quite a while to do that.

As it turns out the current Hunspell dictionary covers 124,340 different words with 52890 root words.

My old MySpell en_US dictionary based on Kevin Atkinson's aspell wordlists and things actually covers:

152468 different words with 62072 root words.

Now coverage isn't everything but the older wordlist based on the author's of scowl used at that time used has greater coverage.

So I am not sure why the Hunspell dictionary has regressed so much (from a spell check perspective only). I would have thought as new words are constantly being created, that the coverage of the latest hunspell en_US dictionary would be larger than the 124k words.

I will compare the two lists to each other and scowl at different frequency levels to try to come up with a good compromise.

Update:

It seems the difference in coverage are many. The Hunspell en US dictionary includes lots of proper first names (where spelling differences typically abound), and some rare forms of words that may not merit inclusion when compared to the older MySpell dictionary.

Here are a few examples:

+Aachen's
+Aaren
+Aaren's
+Aarhus
+Aarhus's
+Aarika
+Aarika's
+Abagael
+Abagael's
+Abagail
+Abagail's
...
+Yasmeen
+Yasmeen's
+Yasmin
+Yasmin's

and things like

+allegoricalness
+allegoricalness's

So the question remains, does anyone expect a spelling dictionary to know all of the variations of people's first name? I would think not.

I would think those are things better suited for the User Dictionary not the main en_US dictionary.

Quite the mess of things indeed. Perhaps using scowl and beginning from scratch would be better.

Any thoughts on what words are best suited for a main dictionary given all of the above, welcome from anyone.

Last edited by KevinH; 01-18-2022 at 12:28 PM.
KevinH is offline   Reply With Quote