Quote:
Originally Posted by Ashjuk
Following on from my post in the future items thread regarding dictionaries I have now worked my way through my default user dictionary [...].
|
Thanks for the list. I'll take a closer look.
But there's things such as:
You can see why that's not commonly accepted, because of the typo with:
- Colombians
- People from the South American country of "Colombia".
- Notice the 'o'!!!
That typo was so sneaky, it was even sitting in Sigil forever until I spotted it:
I even wrote a whole LanguageTool bug/request about this:
Here's a few of the common phrases I jotted down:
Columbia (with a 'u')
- British Columbia
- District of Columbia
- Columbia River
but almost everything else that's popular is actually speaking about the country:
Colombia (with an 'o')
- Colombian peso
- Colombian government
- Colombian women
- Colombian military
- Colombian drugs
- [...]
You can see how the spellchecker might want to err on the side of showing that common error... where the grammarchecker can take into account the surrounding words!
If you accidentally wrote:
the grammarchecker will go: "Uhh, did you mean the country?"
Or if you accidentally wrote:
- The Columbians wallet fell to the floor.
you'd want spellcheck to say: "Uhh, did you mean Colombian + Colombian's + Colombians + Columbia's".
These words are WAY more likely (see
Google n-grams).
Quote:
Originally Posted by KevinH
So I am really at a loss here. Should they be kept or removed?
|
See lots of the fantastic discussion 6 years ago in
Firefox Bug #1235506: "en-US dictionary: Additional Mozilla words need to be cleaned up".
Quote:
Originally Posted by KevinH
Even worse ... I checked the long list of First names that MySpell has that hunspell does not into Apple's Pages app and checked them and most of them are actually marked as correct based on the official Apple spellchecker as built into Pages!
So it appears that people's first names are included in in many official spellcheckers.
|
I side with SCOWL.
Names that are extremely common, like names like:
- Einstein
- Newton
- Aristotle
- Beethoven
Yes.
Names of famous cities/places:
- Everest
- Paris
- Berlin
- Washington
Yes.
But getting into rarer and more extreme names? And every name under the sun?
No.
That's why SCOWL's "size 60" list is used as the default. SCOWL has already gone through and included the most popular names/places.
As you rise up through "size 70" (Large) and "size 80", the list of "correctly spelled names" explodes.
In most cases though, these rarer names/spellings only make sense in very specific contexts.
- - -
Side Note: There is also the case of:
Company Names
"Are company names words?" Most dictionaries say
NO.
A word like "Facebook" isn't an actual, definable word, and shouldn't belong in the actual dictionary.
... but in the context of spellchecking, yes, some famous companies such as:
- Microsoft
- Google
- Facebook
- Coca-Cola
- IBM
- NVIDIA
- Qualcomm
or programs:
should be included as exceptions.
(This is where some of LanguageTool's lists help... but they go too far and begin accepting TOO MANY company names. Again, I agree/side with SCOWL's assessment. See some discussion about LT's lists like
wordlist Issue #181])
Acronyms
Similar situation with acronyms. Super common ones that exist in dictionaries?
... but accepting every acronym under the sun? No!
(LT leans VERY far into that direction. Accept as much as they can, because they're worried mostly about the grammar squigglies, not the spelling.)
- - -
Side Note #2: Anyway, some of this is also described in detail in the:
-
SCOWL Readme
especially the section on "proper-names".
There's also a list of how many new Words vs. Names (+ Total) are added in each list:
Code:
Size Words Names Running Total %
10 4,425 13 4,438 0.7
20 8,126 0 12,564 1.9
35 37,260 220 50,044 7.6
40 6,858 489 57,391 8.7
50 25,289 18,683 101,363 15.4
55 6,487 0 107,850 16.4
60 14,552 850 123,252 18.7
70 35,294 7,897 166,443 25.3
80 144,164 33,368 343,975 52.3
95 227,630 86,630 658,235 100.0
You can see by "size 50", there's the most common ~20k names found in actual dictionaries, like:
But beyond the defaults ("size 60"), the names begin exploding, leading to MUCH more chances of false positives.
Like size 70 begins introducing:
- Addressograph
- Adelbert
- Adigranth
- Beaverboard
- Benedicite
- Blackmun
- Pianolas
- [...]
size 80 begins introducing smaller cities/towns (I believe everything over 10k population?):
- Alstead
- Altaloma
- Amburgey
- Amherstdale
- Plumtree
- Spitalfields
- [...]
and by "size 95", you're getting all these obscure animal/biology terms too (Genuses):
- Heterodontus
- Hexamita
- [...]
... Again, SCOWL has already done the "most commonly used words" legwork!!! Stick with the defaults.
Everything beyond that point would be the very rare exceptions! (ALthough you might catch stuff like "Facebook", etc.)
Quote:
Originally Posted by KevinH
They are not part of scowl but official Apple spellcheckers say they are okay. I wonder how platform specific spellchecking is. I do not have Word to compare it and LibreOffice uses hunspell.
|
Word misses many things Sigil catches.
Sigil misses many things Word catches.
InDesign misses things Sigil/Word catch.
(This is why I recommend a layered approach when spellchecking! 1 (or more) rounds of spellchecking in multiple programs.)
Quote:
Originally Posted by KevinH
Then of course, they are the differences in word lists attributed to urban slang. For example "zorch" or "zorched". I had to look thatone up and the only place I found itwas an "urban dictionary" and that is meant "ruined" or"burnt out" as in you "zorched your iphone".
|
Slang, "Hacker" words, 1337speak, and all this other stuff gets relegated to other "variant" lists (or not at all).
Again, these things are mostly obscure subcultures, or not "actual" English!
Perhaps one day, the terms rise in popularity and become "actual words" in the general language... but definitely not polluting default spellcheck lists. :P
These spellchecking dictionaries have to lean much more towards the conservative side, because it's much better to:
- CATCH the typo (and recommend ACTUAL WORDS in the right-click)
than to:
- MISS the error (or recommend junk like "zorched")
The default lists should be "size 60", leaning more towards the conservative side, with very rare exceptions added on top.
Quote:
Originally Posted by KevinH
Based on official commercial spellcheckers in Word and Pages, there are major differences. So spell checker dictionary building is quite subjective and so comparisons of "quality" are very hard to make.
|
Yep. There's the balancing act between:
- "red squigglies on too many words" vs. "missing too many actual errors"
This is why SCOWL strongly bases itself on actual English popularity+usage, and heavily curates new additions.
(Like we discussed in the previous topic [and I went into detail in my Reddit posts]... a small fraction of all possible words covers more than 90%+ of real-life usage.)
And again, I wouldn't worry
too much about Sigil's default lists, because we have the fantastic Spellcheck Lists. This is the ultimate tool, and allows you to
spellcheck an entire book WAY WAY faster than those one-by-one methods.
(You could even use it to quickly find "misspelled words" + Add to Dictionary or Ignore.
Similar to the trick I did back in 2019 to catch "foreign words".)
Quote:
Originally Posted by KevinH
For example, if I am writing a formal document or dissertation, "zorch" should probably be marked wrong as it is just one character away from "porch". But if I am writing modern fiction, "zorch" being correct might be okay!
|
And "scorch".
(The 's' is extremely close to the 'z'.)
More Side Note: This kind of spelling (+autocorrecting) mess is also becoming much more prevalent with the keyboards+swiping on phones.
Do you know how many actual typos occur because of the virtual keyboard... and then how many autocorrect typos get introduced? Way, way too many.
Especially frustrating are the valid words where it magically adds a space too! (away -> a way).
(This has been angering me so much, that for the last year I've been compiling a big ol' list to submit to LT... soon... soon.

)