MobileRead Forums - View Single Post

KevinH · 01-18-2022, 11:15 AM

In common US English, the plural form of "swine" is in fact "swine" not "swines". I would guess your online dictionary is not quite correct in regards to this particular word, especially for US english.

But this may also be a point of divergence between GB vs US based dictionaries.

Please understand, how a dictionary is built or rebuilt is different from what words it actually recognizes as correct. To build a typical western dictionary, you start with a giant list of correctly spelled words in any unique variation ("windows" is different from "window", etc). These can be generated from edited material of intelligent scholars, edited newspapers, and etc. (the corpus).

Once you have this list (call it the corpus word list), you then define how your language makes plural forms, possessive forms, what suffixes are typically attached to the end, what prefixes are appended to the beginning, what rules or conditions need to be met for those to be added, etc. This forms the bulk of the .aff file. Please note, that after stripping prefixes and suffixes following the rules and conditions, the resulting root word *MUST" exist in in the wordlist on its own (this is not true for general compression, just affix compression).

As a next step you look through the list removing any "rare" words that are similar enough to "common" words that they could be generated by a simple typo (short edit distance). This is what scowl excels at (it groups words based on usage frequency). The remaining list of words is a "working set" for your language.

This working set is used in conjunction with the .aff (affix rules) to "compress" the working set to a set of "root" words plus flags to mark which affix rules are allowed to be applied to the root word. If for example a suffix fits the rules for a word but the variation of the word with that suffix was not in the "working set" then no flag is added. This process was called "munching" the "working set".

The result is a .aff file (with lots of extra pieces added to help it make better suggestions, handle parts of speech id, phonetic based spelling error corrections) and a .dic file (which is a list of root words plus any flags).

This works well for the English family of languages and many others originally covered by MySpell or ispell dictionaries. It does not work on languages that allow any combination of words to be itself a word, any combination of prefixes to make new prefixes, any combination of suffixes to make new suffixes, any combinations with pieces that are not actually a word in and of itself, etc.

Hungarian has this issue which is why when I retired from MySpell and the OpenOffice lingucomponent project, Hunspell absorbed my old MySpell codebase that was used in OpenOffice and Mozilla. They then had to greatly extend the basic affix compression approach (and the format of the .aff file as a result) to try to do better than MySpell/ispell ever could with those languages. In doing so they broke the original ability to unmunch a dictionary and really have no way to do that now.

This is great for many many languages that MySpell/ispell never supported (or did not support well), but really does not help "en" and other original MySpell languages where the approach of a wordlist corpus built from edited and scholarly texts, books and other materials is the right approach.

This is where we are at now.

I can still unpack "en" based languages, build a wordlist. I can look at where these words are now in common usage frequency and build an improved "working set" and affix compress them back to improve the en dictionaries. This approach would help with any other language that started out life as a MySpell based dictionary.

Hope this explains things better.

Quote:

Originally Posted by Ashjuk

That is what I would have expected, but not what I experience in practice -
as you can see from these screen shots.

The word 'swines' has been highlighted as misspelled.

Removing the 's' corrects the problem

But (as you can see) swines is the valid plural of swine.

OK, I will upload it later.