View Single Post
Old 01-17-2022, 01:07 PM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,911
Karma: 6120478
Join Date: Nov 2009
Device: many
Thanks for posting those links. I ran both lists through the scowl search website and it found some of those words in scowl in full but not that many.

It seems that Hunspell is in a sorry state itself. Back before it became Hunspell, it was Myspell and MySpell had a unmunch tool that took the current .dic and .aff file and created a corpus wordlist from it. Similarly it also had a munch tool that didthe opposite. Unfortunately, these tools are broken for hunspell now as hunspell changed and greatly expanded how prefixes and suffixes were defined in the .aff file and changed how compounds words are supported. They do not document their new format anyplace. This completely broke munch and unmunch, but no one there felt it was important enough to fix them.

They do have a wordforms program which works only on a single base word at a time but that does grok the latest extended .aff file formats, but very slowly. I am going to give that a try, alongside their affixcompress.

There is even a bug report in Hunspell about this that has been open for literally years with recent posts but no solutions.

https://github.com/hunspell/hunspell/issues/404


So it appears most dictionary maintainers have to just drop prefixes and even affixes and just add new words on the end of the .dic file, which completely defeats the whole purpose of affix compression to shrink wordlists for much faster access and much smaller memory footprints.

So this leaves me at a bit of a quandary. The tools are not there to do things properly with current hunspell. Luckily hunspell can still read and work with the older .aff format that MySpell developed which is more than enough for many languages like english, spanish, italian, etc but not for languages like hungarian, polish, etc.

So I can use the latest scowl wordlists up through and including 70, and then add some curated additional words and then munch them with the MySpell aff file to create a proper .dic that will still work with hunspell.

Sad really.

Update: I tried hunspells wordforms script but it is buggy enough that it will not produce words that only exist as root words (no prefixes or suffixes). So running wordforms on "aflame", "aback", "abet" and etc. will not produce output that the root word is itself correct.

That makes automating the generation of a word list much harder than it needs to be. Worst of all, all it really does is generate all possible prefix and suffix words even complete nonsense and pass the hunspell spell checker. There are no conditions for adding the prefix or suffix even used. It is quadratic or higher in time. Languages with lots of root words , lots of prefixes, lots of suffixes and/or compound words would never be able to use it.

I will have to write real code to do this properly.

Last edited by KevinH; 01-17-2022 at 03:12 PM.
KevinH is offline   Reply With Quote