View Single Post
Old 01-19-2022, 08:00 AM   #14
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Ashjuk View Post
Following on from my post in the future items thread regarding dictionaries I have now worked my way through my default user dictionary [...].
Thanks for the list. I'll take a closer look.

But there's things such as:
  • Columbians

You can see why that's not commonly accepted, because of the typo with:
  • Colombians
    • People from the South American country of "Colombia".
    • Notice the 'o'!!!

That typo was so sneaky, it was even sitting in Sigil forever until I spotted it:

I even wrote a whole LanguageTool bug/request about this:

Here's a few of the common phrases I jotted down:

Columbia (with a 'u')
  • British Columbia
  • District of Columbia
  • Columbia River

but almost everything else that's popular is actually speaking about the country:

Colombia (with an 'o')
  • Colombian peso
  • Colombian government
  • Colombian women
  • Colombian military
  • Colombian drugs
  • [...]

You can see how the spellchecker might want to err on the side of showing that common error... where the grammarchecker can take into account the surrounding words!

If you accidentally wrote:
  • Columbian peso

the grammarchecker will go: "Uhh, did you mean the country?"

Or if you accidentally wrote:
  • The Columbians wallet fell to the floor.

you'd want spellcheck to say: "Uhh, did you mean Colombian + Colombian's + Colombians + Columbia's".

These words are WAY more likely (see Google n-grams).

Quote:
Originally Posted by KevinH View Post
So I am really at a loss here. Should they be kept or removed?
See lots of the fantastic discussion 6 years ago in Firefox Bug #1235506: "en-US dictionary: Additional Mozilla words need to be cleaned up".

Quote:
Originally Posted by KevinH View Post
Even worse ... I checked the long list of First names that MySpell has that hunspell does not into Apple's Pages app and checked them and most of them are actually marked as correct based on the official Apple spellchecker as built into Pages!

So it appears that people's first names are included in in many official spellcheckers.
I side with SCOWL.

Names that are extremely common, like names like:
  • Einstein
  • Newton
  • Aristotle
  • Beethoven

Yes.

Names of famous cities/places:
  • Everest
  • Paris
  • Berlin
  • Washington

Yes.

But getting into rarer and more extreme names? And every name under the sun?

No.

That's why SCOWL's "size 60" list is used as the default. SCOWL has already gone through and included the most popular names/places.

As you rise up through "size 70" (Large) and "size 80", the list of "correctly spelled names" explodes.

In most cases though, these rarer names/spellings only make sense in very specific contexts.

- - -

Side Note: There is also the case of:

Company Names

"Are company names words?" Most dictionaries say NO.

A word like "Facebook" isn't an actual, definable word, and shouldn't belong in the actual dictionary.

... but in the context of spellchecking, yes, some famous companies such as:
  • Microsoft
  • Google
  • Facebook
  • Coca-Cola
  • IBM
  • NVIDIA
  • Qualcomm

or programs:
  • Firefox
  • Photoshop
  • Linux

should be included as exceptions.

(This is where some of LanguageTool's lists help... but they go too far and begin accepting TOO MANY company names. Again, I agree/side with SCOWL's assessment. See some discussion about LT's lists like wordlist Issue #181])

Acronyms

Similar situation with acronyms. Super common ones that exist in dictionaries?
  • FBI
  • CIA
  • USA
  • JPG
  • [...]

... but accepting every acronym under the sun? No!

(LT leans VERY far into that direction. Accept as much as they can, because they're worried mostly about the grammar squigglies, not the spelling.)

- - -

Side Note #2: Anyway, some of this is also described in detail in the:

- SCOWL Readme

especially the section on "proper-names".

There's also a list of how many new Words vs. Names (+ Total) are added in each list:

Code:
  Size   Words       Names    Running Total  %
   10    4,425          13        4,438     0.7
   20    8,126           0       12,564     1.9
   35   37,260         220       50,044     7.6
   40    6,858         489       57,391     8.7
   50   25,289      18,683      101,363    15.4
   55    6,487           0      107,850    16.4
   60   14,552         850      123,252    18.7
   70   35,294       7,897      166,443    25.3
   80  144,164      33,368      343,975    52.3
   95  227,630      86,630      658,235   100.0
You can see by "size 50", there's the most common ~20k names found in actual dictionaries, like:
  • Einstein
  • Newton
  • Hawking

But beyond the defaults ("size 60"), the names begin exploding, leading to MUCH more chances of false positives.

Like size 70 begins introducing:
  • Addressograph
  • Adelbert
  • Adigranth
  • Beaverboard
  • Benedicite
  • Blackmun
  • Pianolas
  • [...]

size 80 begins introducing smaller cities/towns (I believe everything over 10k population?):
  • Alstead
  • Altaloma
  • Amburgey
  • Amherstdale
  • Plumtree
  • Spitalfields
  • [...]

and by "size 95", you're getting all these obscure animal/biology terms too (Genuses):
  • Heterodontus
  • Hexamita
  • [...]

... Again, SCOWL has already done the "most commonly used words" legwork!!! Stick with the defaults.

Everything beyond that point would be the very rare exceptions! (ALthough you might catch stuff like "Facebook", etc.)

Quote:
Originally Posted by KevinH View Post
They are not part of scowl but official Apple spellcheckers say they are okay. I wonder how platform specific spellchecking is. I do not have Word to compare it and LibreOffice uses hunspell.
Word misses many things Sigil catches.

Sigil misses many things Word catches.

InDesign misses things Sigil/Word catch.

(This is why I recommend a layered approach when spellchecking! 1 (or more) rounds of spellchecking in multiple programs.)

Quote:
Originally Posted by KevinH View Post
Then of course, they are the differences in word lists attributed to urban slang. For example "zorch" or "zorched". I had to look thatone up and the only place I found itwas an "urban dictionary" and that is meant "ruined" or"burnt out" as in you "zorched your iphone".
Slang, "Hacker" words, 1337speak, and all this other stuff gets relegated to other "variant" lists (or not at all).

Again, these things are mostly obscure subcultures, or not "actual" English!

Perhaps one day, the terms rise in popularity and become "actual words" in the general language... but definitely not polluting default spellcheck lists. :P

These spellchecking dictionaries have to lean much more towards the conservative side, because it's much better to:
  • CATCH the typo (and recommend ACTUAL WORDS in the right-click)

than to:
  • MISS the error (or recommend junk like "zorched")

The default lists should be "size 60", leaning more towards the conservative side, with very rare exceptions added on top.

Quote:
Originally Posted by KevinH View Post
Based on official commercial spellcheckers in Word and Pages, there are major differences. So spell checker dictionary building is quite subjective and so comparisons of "quality" are very hard to make.
Yep. There's the balancing act between:
  • "red squigglies on too many words" vs. "missing too many actual errors"

This is why SCOWL strongly bases itself on actual English popularity+usage, and heavily curates new additions.

(Like we discussed in the previous topic [and I went into detail in my Reddit posts]... a small fraction of all possible words covers more than 90%+ of real-life usage.)

And again, I wouldn't worry too much about Sigil's default lists, because we have the fantastic Spellcheck Lists. This is the ultimate tool, and allows you to spellcheck an entire book WAY WAY faster than those one-by-one methods.

(You could even use it to quickly find "misspelled words" + Add to Dictionary or Ignore. Similar to the trick I did back in 2019 to catch "foreign words".)

Quote:
Originally Posted by KevinH View Post
For example, if I am writing a formal document or dissertation, "zorch" should probably be marked wrong as it is just one character away from "porch". But if I am writing modern fiction, "zorch" being correct might be okay!
And "scorch".

(The 's' is extremely close to the 'z'.)

More Side Note: This kind of spelling (+autocorrecting) mess is also becoming much more prevalent with the keyboards+swiping on phones.

Do you know how many actual typos occur because of the virtual keyboard... and then how many autocorrect typos get introduced? Way, way too many.

Especially frustrating are the valid words where it magically adds a space too! (away -> a way).

(This has been angering me so much, that for the last year I've been compiling a big ol' list to submit to LT... soon... soon. )

Last edited by Tex2002ans; 01-19-2022 at 09:11 AM.
Tex2002ans is offline   Reply With Quote