01-17-2022, 07:55 AM | #1 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Sigil Dictionary Update
Following on from my post in the future items thread regarding dictionaries I have now worked my way through my default user dictionary comparing this against the US dictionary as bundled with Sigil 1.8.
I checked for validity by using the on-line version of the Merriam-Webster dictionary, and I now have a file that I consider are possible candidates for future inclusion. I have also a short list of new words that have come into common usage that could also be included. I will now check my default file against a UK dictionary for updating the en-GB dictionary currently bundled. I have uploaded the files processed so far to my Google Drive - https://drive.google.com/drive/folde...Vo?usp=sharing |
01-17-2022, 01:07 PM | #2 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Thanks for posting those links. I ran both lists through the scowl search website and it found some of those words in scowl in full but not that many.
It seems that Hunspell is in a sorry state itself. Back before it became Hunspell, it was Myspell and MySpell had a unmunch tool that took the current .dic and .aff file and created a corpus wordlist from it. Similarly it also had a munch tool that didthe opposite. Unfortunately, these tools are broken for hunspell now as hunspell changed and greatly expanded how prefixes and suffixes were defined in the .aff file and changed how compounds words are supported. They do not document their new format anyplace. This completely broke munch and unmunch, but no one there felt it was important enough to fix them. They do have a wordforms program which works only on a single base word at a time but that does grok the latest extended .aff file formats, but very slowly. I am going to give that a try, alongside their affixcompress. There is even a bug report in Hunspell about this that has been open for literally years with recent posts but no solutions. https://github.com/hunspell/hunspell/issues/404 So it appears most dictionary maintainers have to just drop prefixes and even affixes and just add new words on the end of the .dic file, which completely defeats the whole purpose of affix compression to shrink wordlists for much faster access and much smaller memory footprints. So this leaves me at a bit of a quandary. The tools are not there to do things properly with current hunspell. Luckily hunspell can still read and work with the older .aff format that MySpell developed which is more than enough for many languages like english, spanish, italian, etc but not for languages like hungarian, polish, etc. So I can use the latest scowl wordlists up through and including 70, and then add some curated additional words and then munch them with the MySpell aff file to create a proper .dic that will still work with hunspell. Sad really. Update: I tried hunspells wordforms script but it is buggy enough that it will not produce words that only exist as root words (no prefixes or suffixes). So running wordforms on "aflame", "aback", "abet" and etc. will not produce output that the root word is itself correct. That makes automating the generation of a word list much harder than it needs to be. Worst of all, all it really does is generate all possible prefix and suffix words even complete nonsense and pass the hunspell spell checker. There are no conditions for adding the prefix or suffix even used. It is quadratic or higher in time. Languages with lots of root words , lots of prefixes, lots of suffixes and/or compound words would never be able to use it. I will have to write real code to do this properly. Last edited by KevinH; 01-17-2022 at 03:12 PM. |
01-18-2022, 04:39 AM | #3 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
I have to admit I was surprised myself that the majority in my list were derivations of existing words and not unknown.
I know nothing about how spellcheckers work, but I was puzzled that they do not seem to be able to detect the simple addition of 's' to pluralise a word, or an apostrophe to denote possession. I have deleted a lot from my list that were just that. From what you have said it appears that by trying to fix this might open up a huge can of worms. Perhaps it would be best for all just to leave things as they are and carry on adding words to a default user dictionary as they are encountered. I will probably abandon the checking for the UK. It's a very time consuming operation checking each word in the list against a dictionary. But if you do manage to find a way of amending the word lists I will pick it up again. Thanks for taking a look anyway. |
01-18-2022, 04:57 AM | #4 |
Evangelist
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
|
Not all Hunspell dictionaries can be unmunched because of compounding. Even without compounding, some languages have extremely productive affixes (for example, some slavic languages use a separate adjective for each ordinal numeral, each of which can take 10-15 different adjectival endings).
Sadly, munching and unmunching left in ispell days of simple affix files (like the English one from the ispell documentation). |
01-18-2022, 08:33 AM | #5 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
@Sarmat89
Understood, but munch and unmunch in MySpell was much better than ispell which did not handle cross products well at all. But many languages as you said need much more. @Ashjukj As for simple plurals and possessives, they should in fact be added to the wordlist corpus. When affix compressed (munched) the plurals are properly detected and the suffixes are stripped and replaced with a flag which keeps the root word list (.dic) small but word coverage large. That is the whole point of affix compression. Do please do generate your UK wordlist (no need to check each one) we can do that via scowl. Include all variations of the word you have encountered. We can fix the en based dictionaries. Last edited by KevinH; 01-18-2022 at 09:41 AM. Reason: fixed my typos and made it clearer who I was responding to |
01-18-2022, 09:08 AM | #6 | |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Quote:
as you can see from these screen shots. The word 'swines' has been highlighted as misspelled. Removing the 's' corrects the problem But (as you can see) swines is the valid plural of swine. OK, I will upload it later. |
|
01-18-2022, 10:15 AM | #7 | |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
In common US English, the plural form of "swine" is in fact "swine" not "swines". I would guess your online dictionary is not quite correct in regards to this particular word, especially for US english.
But this may also be a point of divergence between GB vs US based dictionaries. Please understand, how a dictionary is built or rebuilt is different from what words it actually recognizes as correct. To build a typical western dictionary, you start with a giant list of correctly spelled words in any unique variation ("windows" is different from "window", etc). These can be generated from edited material of intelligent scholars, edited newspapers, and etc. (the corpus). Once you have this list (call it the corpus word list), you then define how your language makes plural forms, possessive forms, what suffixes are typically attached to the end, what prefixes are appended to the beginning, what rules or conditions need to be met for those to be added, etc. This forms the bulk of the .aff file. Please note, that after stripping prefixes and suffixes following the rules and conditions, the resulting root word *MUST" exist in in the wordlist on its own (this is not true for general compression, just affix compression). As a next step you look through the list removing any "rare" words that are similar enough to "common" words that they could be generated by a simple typo (short edit distance). This is what scowl excels at (it groups words based on usage frequency). The remaining list of words is a "working set" for your language. This working set is used in conjunction with the .aff (affix rules) to "compress" the working set to a set of "root" words plus flags to mark which affix rules are allowed to be applied to the root word. If for example a suffix fits the rules for a word but the variation of the word with that suffix was not in the "working set" then no flag is added. This process was called "munching" the "working set". The result is a .aff file (with lots of extra pieces added to help it make better suggestions, handle parts of speech id, phonetic based spelling error corrections) and a .dic file (which is a list of root words plus any flags). This works well for the English family of languages and many others originally covered by MySpell or ispell dictionaries. It does not work on languages that allow any combination of words to be itself a word, any combination of prefixes to make new prefixes, any combination of suffixes to make new suffixes, any combinations with pieces that are not actually a word in and of itself, etc. Hungarian has this issue which is why when I retired from MySpell and the OpenOffice lingucomponent project, Hunspell absorbed my old MySpell codebase that was used in OpenOffice and Mozilla. They then had to greatly extend the basic affix compression approach (and the format of the .aff file as a result) to try to do better than MySpell/ispell ever could with those languages. In doing so they broke the original ability to unmunch a dictionary and really have no way to do that now. This is great for many many languages that MySpell/ispell never supported (or did not support well), but really does not help "en" and other original MySpell languages where the approach of a wordlist corpus built from edited and scholarly texts, books and other materials is the right approach. This is where we are at now. I can still unpack "en" based languages, build a wordlist. I can look at where these words are now in common usage frequency and build an improved "working set" and affix compress them back to improve the en dictionaries. This approach would help with any other language that started out life as a MySpell based dictionary. Hope this explains things better. Quote:
|
|
01-18-2022, 11:39 AM | #8 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
For the record, it took a while but I was able to unpack the current Hunspell en_US.aff and en_US.dic into its "working list" of words using repeated use of Hunspell's "wordforms" one root word at a time. It took quite a while to do that.
As it turns out the current Hunspell dictionary covers 124,340 different words with 52890 root words. My old MySpell en_US dictionary based on Kevin Atkinson's aspell wordlists and things actually covers: 152468 different words with 62072 root words. Now coverage isn't everything but the older wordlist based on the author's of scowl used at that time used has greater coverage. So I am not sure why the Hunspell dictionary has regressed so much (from a spell check perspective only). I would have thought as new words are constantly being created, that the coverage of the latest hunspell en_US dictionary would be larger than the 124k words. I will compare the two lists to each other and scowl at different frequency levels to try to come up with a good compromise. Update: It seems the difference in coverage are many. The Hunspell en US dictionary includes lots of proper first names (where spelling differences typically abound), and some rare forms of words that may not merit inclusion when compared to the older MySpell dictionary. Here are a few examples: +Aachen's +Aaren +Aaren's +Aarhus +Aarhus's +Aarika +Aarika's +Abagael +Abagael's +Abagail +Abagail's ... +Yasmeen +Yasmeen's +Yasmin +Yasmin's and things like +allegoricalness +allegoricalness's So the question remains, does anyone expect a spelling dictionary to know all of the variations of people's first name? I would think not. I would think those are things better suited for the User Dictionary not the main en_US dictionary. Quite the mess of things indeed. Perhaps using scowl and beginning from scratch would be better. Any thoughts on what words are best suited for a main dictionary given all of the above, welcome from anyone. Last edited by KevinH; 01-18-2022 at 12:28 PM. |
01-18-2022, 12:37 PM | #9 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Thanks for the explanation, Kevin. I think I have a slightly better understanding of how spellchecking works now.
There is always going to be this issue with US vs UK - as Shaw is reputed to have said "England and America are two countries separated by a common language". I know you said not to bother but I have opened my UK word list in MS Word and I am using that to spellcheck it. Using that it has returned less than 20% of them as misspelled, which I am checking against a couple of on-line dictionaries. It would be great to be able to use the OED as a reference, but unfortunately my budget does not run to that. I would never expect a dictionary to include people's names. I put them in a separate user Names dictionary as I come across them. |
01-18-2022, 12:40 PM | #10 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Just to test if scowl includes coverage of common first names:
Columns are: [Word In en_US Found In Notes Google Books Stats[*] Should Include Frequency (per million) Newness] Aachen's YES Aaren NO * 0.0043 1.7 Aaren's NO Aarhus NO en_US-large *** 0.5090 1.0 Aarhus's NO Aarika NO * 0.0002 0.5 Aarika's NO Abagael NO * 0 Abagael's NO Abagail NO ** 0.0163 1.1 Abagail's NO ... Yasmeen NO ** 0.0576 1.9 Yasmeen's NO Yasmin NO *** 0.2611 1.5 Yasmin's NO So unless your first name overlaps with a city name or region name, scowl does not include it although they are detected in Google's search. I would think common first names that do not coincide with rivers, counties, countries, states, regions, etc should not be in a spelling tool like hunspell's en_US dictionary. |
01-18-2022, 01:16 PM | #11 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Actually, I have that backwards! It was the old MySpell raw word list that has so many first names not the Hunspell one.
Either way, I think restarting with the scowl lists and then carefully adding curated words makes the most sense. |
01-18-2022, 01:38 PM | #12 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Even worse ... I checked the long list of First names that MySpell has that hunspell does not into Apple's Pages app and checked them and most of them are actually marked as correct based on the official Apple spellchecker as built into Pages!
So it appears that people's first names are included in in many official spellcheckers. So I am really at a loss here. Should they be kept or removed? They are not part of scowl but official Apple spellcheckers say they are okay. I wonder how platform specific spellchecking is. I do not have Word to compare it and LibreOffice uses hunspell. Then of course, they are the differences in word lists attributed to urban slang. For example "zorch" or "zorched". I had to look thatone up and the only place I found itwas an "urban dictionary" and that is meant "ruined" or"burnt out" as in you "zorched your iphone". Based on official commercial spellcheckers in Word and Pages, there are major differences. So spell checker dictionary building is quite subjective and so comparisons of "quality" are very hard to make. For example, if I am writing a formal document or dissertation, "zorch" should probably be marked wrong as it is just one character away from "porch". But if I am writing modern fiction, "zorch" being correct might be okay! Last edited by KevinH; 01-18-2022 at 02:17 PM. |
01-19-2022, 04:47 AM | #13 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
It appears that the word base in applications such as Pages and Word are far more comprehensive than the Hunspell one.
Pasting your list in #10 into my copy of Word 2010 only Aaren, Aarika and Abagael are shown as misspelled. This is also borne out by the fact that when I opened my list of words as flagged by Hunspell as being misspelled in Word probably less than 20% were showing as being so. Personally I don't think it's a good idea to include people's names in a dictionary. These days it seems people seem to want to change the spelling of their 'common' name just stand out from the crowd - and some are just completely bizarre. As for 'zorch' I would have put that into my slang dictionary. Its use is probably quite common in certain groups but not a word in wide usage (at least here in my part of the UK). Just going back a few posts to my example of how removing the 's' affects the spellchecker, and we debated the use of swine vs swines as a plural. I was thinking further about this and it is all dependant on how you use the word swine. 1. Swine as a pig: In that situation I would use swine as the plural - 'a herd of swine'. 2. Swine as in a person behaving badly: In that situation I would use swines - 'you bunch of swines!' Anyway I really had not intend it to become this complicated. I thought that it would have just been a simple matter of just adding some new words to the default dictionary. If it is going to take a lot of work then forget the whole thing, and we can just carry on using our user-defined dictionaries. |
01-19-2022, 08:00 AM | #14 | ||||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But there's things such as:
You can see why that's not commonly accepted, because of the typo with:
That typo was so sneaky, it was even sitting in Sigil forever until I spotted it: I even wrote a whole LanguageTool bug/request about this: Here's a few of the common phrases I jotted down: Columbia (with a 'u')
but almost everything else that's popular is actually speaking about the country: Colombia (with an 'o')
You can see how the spellchecker might want to err on the side of showing that common error... where the grammarchecker can take into account the surrounding words! If you accidentally wrote:
the grammarchecker will go: "Uhh, did you mean the country?" Or if you accidentally wrote:
you'd want spellcheck to say: "Uhh, did you mean Colombian + Colombian's + Colombians + Columbia's". These words are WAY more likely (see Google n-grams). See lots of the fantastic discussion 6 years ago in Firefox Bug #1235506: "en-US dictionary: Additional Mozilla words need to be cleaned up". Quote:
Names that are extremely common, like names like:
Yes. Names of famous cities/places:
Yes. But getting into rarer and more extreme names? And every name under the sun? No. That's why SCOWL's "size 60" list is used as the default. SCOWL has already gone through and included the most popular names/places. As you rise up through "size 70" (Large) and "size 80", the list of "correctly spelled names" explodes. In most cases though, these rarer names/spellings only make sense in very specific contexts. - - - Side Note: There is also the case of: Company Names "Are company names words?" Most dictionaries say NO. A word like "Facebook" isn't an actual, definable word, and shouldn't belong in the actual dictionary. ... but in the context of spellchecking, yes, some famous companies such as:
or programs:
should be included as exceptions. (This is where some of LanguageTool's lists help... but they go too far and begin accepting TOO MANY company names. Again, I agree/side with SCOWL's assessment. See some discussion about LT's lists like wordlist Issue #181]) Acronyms Similar situation with acronyms. Super common ones that exist in dictionaries?
... but accepting every acronym under the sun? No! (LT leans VERY far into that direction. Accept as much as they can, because they're worried mostly about the grammar squigglies, not the spelling.) - - - Side Note #2: Anyway, some of this is also described in detail in the: - SCOWL Readme especially the section on "proper-names". There's also a list of how many new Words vs. Names (+ Total) are added in each list: Code:
Size Words Names Running Total % 10 4,425 13 4,438 0.7 20 8,126 0 12,564 1.9 35 37,260 220 50,044 7.6 40 6,858 489 57,391 8.7 50 25,289 18,683 101,363 15.4 55 6,487 0 107,850 16.4 60 14,552 850 123,252 18.7 70 35,294 7,897 166,443 25.3 80 144,164 33,368 343,975 52.3 95 227,630 86,630 658,235 100.0
But beyond the defaults ("size 60"), the names begin exploding, leading to MUCH more chances of false positives. Like size 70 begins introducing:
size 80 begins introducing smaller cities/towns (I believe everything over 10k population?):
and by "size 95", you're getting all these obscure animal/biology terms too (Genuses):
... Again, SCOWL has already done the "most commonly used words" legwork!!! Stick with the defaults. Everything beyond that point would be the very rare exceptions! (ALthough you might catch stuff like "Facebook", etc.) Quote:
Sigil misses many things Word catches. InDesign misses things Sigil/Word catch. (This is why I recommend a layered approach when spellchecking! 1 (or more) rounds of spellchecking in multiple programs.) Quote:
Again, these things are mostly obscure subcultures, or not "actual" English! Perhaps one day, the terms rise in popularity and become "actual words" in the general language... but definitely not polluting default spellcheck lists. :P These spellchecking dictionaries have to lean much more towards the conservative side, because it's much better to:
than to:
The default lists should be "size 60", leaning more towards the conservative side, with very rare exceptions added on top. Quote:
This is why SCOWL strongly bases itself on actual English popularity+usage, and heavily curates new additions. (Like we discussed in the previous topic [and I went into detail in my Reddit posts]... a small fraction of all possible words covers more than 90%+ of real-life usage.) And again, I wouldn't worry too much about Sigil's default lists, because we have the fantastic Spellcheck Lists. This is the ultimate tool, and allows you to spellcheck an entire book WAY WAY faster than those one-by-one methods. (You could even use it to quickly find "misspelled words" + Add to Dictionary or Ignore. Similar to the trick I did back in 2019 to catch "foreign words".) Quote:
(The 's' is extremely close to the 'z'.) More Side Note: This kind of spelling (+autocorrecting) mess is also becoming much more prevalent with the keyboards+swiping on phones. Do you know how many actual typos occur because of the virtual keyboard... and then how many autocorrect typos get introduced? Way, way too many. Especially frustrating are the valid words where it magically adds a space too! (away -> a way). (This has been angering me so much, that for the last year I've been compiling a big ol' list to submit to LT... soon... soon. ) Last edited by Tex2002ans; 01-19-2022 at 09:11 AM. |
||||||
01-19-2022, 08:53 AM | #15 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
I have now uploaded the UK word list to my Google Drive (link as previous).
I checked these by opening the file in Word 2010 (set with UK spelling) and checking those that Word highlighted as misspelled. I used the following as references to confirm validity. https://www.lexico.com/ https://www.collinsdictionary.com/ and a digital copy of the OED |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil newbie dictionary questions | michaelbr | Sigil | 8 | 12-06-2020 09:41 AM |
Content Dictionary update availability | ntamas | Amazon Kindle | 7 | 10-05-2019 01:03 PM |
Dictionary plugin in Sigil? For example Oxford-English Dictionary. | Rindr | Plugins | 2 | 03-04-2018 11:11 AM |
PRS-600 Dictionary not working after firmware update | pakiyabhai | Sony Reader | 1 | 10-24-2009 09:02 PM |
Update Problem and Dictionary Question | barryp | Sony Reader | 8 | 09-22-2008 05:56 AM |