View Single Post
Old 09-01-2022, 11:10 PM   #21
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Okay, so I woke up today... and after giving it some thought, I needed to put these numbers into context.

I grabbed a 1.9 million word journal I've worked on and ran it through Spellcheck Lists.

There were 4 categories:
  • Yes/No Periods
    • Sigil 1.9.10 default vs. Pre-1.9.10.
  • Yes/No Numbers
    • "Check Numbers" option on/off.

Here are the results:

Click image for larger version

Name:	Periods.and.Nums.Comparison.png
Views:	221
Size:	10.3 KB
ID:	196221

Here's the chart of "sentence-enders" vs. "acronyms":

Click image for larger version

Name:	Sentence-Ender.vs.Acronyms.png
Views:	226
Size:	7.9 KB
ID:	196224

Here's the raw data:

Code:
      Total Words = 1921208

Category             Unique   Differ.  % Drop from Prev.

Periods               75160
No Periods            61666    13494    17.95%

Periods+No Nums       66399
No Periods+No Nums    54334     7332    11.04%


# of "Sentence-End"   13494
# of "Acronyms"         188
The way I see it:
  • Yes, Sigil 1.9.10 made spellchecking of ~200 acronyms (0.2%) more accurate...
  • but created ~11–18% more "false positives".

These false positives:
  • Create "visual clutter"
  • + exacerbate all the problems I mention in the previous posts, making every step of the "proofing" chain slower + less effective.

- - - -

Info (Acronyms)

I considered "acronyms" as all inter-word periods:
  • Common Phrases
    • i.e. + e.g.
    • a.m. + p.m.
    • A.D. + B.C.
    • Ph.D.
  • First + Middle Initial
    • F.A. (Hayek)
    • W.E.B. (Du Bois)
  • Acronyms
    • F.B.I.
    • C.I.A.
    • U.S.A.
    • U.S.S.R.
    • U.C.L.A.
  • States
    • N.Y.
    • N.J.
    • P.A.

Flaws in Counting Method

I did not include URLs (this journal didn't have any) or many of the categories I listed in:

I considered "sentence-ending" to be "only letters + 1 period at end":

While this included the valid:
  • Mr. / Mrs. / Dr.
  • St. (Saint / Street)

this is just a tiny fraction—maybe a few dozen—the vast majority are "duplicate word + period"s.

- - -

Side Note: Quick Acronyms

In Sigil's Spellcheck Lists, searching for '.' instantly listed nearly all acronyms.

This is "Show All Words" Checked/Unchecked:

Pre-1.9.10:

Click image for larger version

Name:	Pre-Sigil.-.Spellcheck.List.-.Checked.Show.All.png
Views:	221
Size:	8.8 KB
ID:	196222 Click image for larger version

Name:	Pre-Sigil.-.Spellcheck.List.-.Unchecked.Show.All.png
Views:	231
Size:	8.9 KB
ID:	196223

vs. Sigil 1.9.10:

Click image for larger version

Name:	Sigil.1.9.10.-.Spellcheck.List.-.Checked.png
Views:	228
Size:	9.8 KB
ID:	196225 Click image for larger version

Name:	Sigil.1.9.10.-.Spellcheck.List.-.Unchecked.png
Views:	223
Size:	9.0 KB
ID:	196226

As you can see, in Sigil 1.9.10—no matter if "Show All" is on/off—it's still flooded with multiple thousands of extras:
  • Checked
    • = Every acronym
    • + every "word + period"
  • Unchecked
    • = Every acronym
    • + nearly every "ALL CAPS + period"
    • + every "misspelled + period".
      • Including nearly everyone's last names, like "Clayton."!

Pre-Sigil 1.9.10:
  • This was a split second skimming.
  • The list was almost pure true acronyms.

Sigil 1.9.10 default:
  • ~200 true acronyms are buried under thousands of sentence-enders.
    • Marginally better when toggling "Show All" ON/OFF.

- - -

Acronym Differences (Sigil 1.9.10 vs. Pre-Change)

When I compared between:
  • Misspelled On/Off

I got 14/188 different acronyms:

Spoiler:
Code:
A.D.
a.m.
A.M.
B.C.
E.g.
e.g.
E.G.
m.p.h.
M.P.H.
N.P.
n.p.
Y.W.C.A.


Sigil 1.9.10 shifted these from misspelled -> correct.

The rest were all the same pre- + post-change.

Acronym Recommendations (Sigil 1.9.10 vs. Pre-change)

Yes, here, I agree, Sigil 1.9.10 handles the acronyms much better:

Code:
Original    1.9.10     Pre

A.C.L.U.    A.C.L.U.   ACOLYTE
A.F.L.      A.F.       AWFUL
C.I.A.      C.I.A.     ACACIA
Y.W.C.A.    Y.W.C.A.   ACADEMY
F.B.I.      B.F.A.     FABIAN
F.B.        FIB        FIB
U.S.A.      U.S.A.     USAGE
U.S.S.R.    U.S.S.R.   SAUSSURE
E.g.        Eg         Eng
Ph.D.       Ph. D.     Ph. D.
but, again, at what cost?
  • ~0.2% of cases getting more accurate recommendations.
    • And acronyms are hard to even find in the List now!
  • vs. ~10–20% guaranteed "visual clutter".
    • In all use-cases of Spellcheck Lists.

- - -

Thought: Hmmmm.... just spitballing ideas out there.

Perhaps something could be done like:
  • If period at end + all letters are capital
    • Consider '.' part of word + use better recommendations.
  • If period at end + any letters are lowercase
    • Trim '.' off end + act the old way.

This would still not be good for things like "Ph.D.", but I believe the vast majority of these true acronyms are of the:

ALL CAPS-type:
  • F.B.I.
  • C.I.A.
  • A.D. / B.C.

This would then remove duplicates like:
  • word.
  • Clayton.
  • Rothbard.
  • Jumbled.

and lower the cluttering by a ton (plus keeping accurate word counts for all non-acronym words!).

- - -

Thought #2: I still think a toggle for "Check Periods" would be great.

Again, I can see some usage for this.

(It actually helped me catch a few typos where I missed the closing period on a "U.S.S.R"!)

But, just like the Numbers, it creates MANY more "false positives".

Allowing it to be toggled ON/OFF would allow advanced users to use it, if needed.

As you can see in the stats above:
  • Periods On adds ~10–20% clutter.
  • Numbers On adds ~12%+ clutter.

- - - -

Quote:
Originally Posted by KevinH View Post
And if you want to make accurate counts, I recommend using the new Saved Search Group Counts Report feature and not trying to use SpellCheck for that. It was added for just that purpose.
This is madness!

Why are Spellcheck Lists great?

Because they list all unique words (1-grams) and display them in such a compact form!

To see how/why n-grams are so powerful, see my recent posts in:

Again, I've written about all this stuff since Spellcheck Lists were first introduced back in 2013 (Sigil 0.7.0) based on my recommendation!

You already had near-perfection for all these years. And then you:

And now, 2022, all Spellcheck Lists needed was a little tweak along the edge (acronyms)!

But this new way... no. In my mind, it's 1 micro-step forward, 2 giant leaps backward!

- - -

Come on, KevinH (and Diap)...

Listen to your bestest buddy Tex. When have I ever lead you wrong in all these years?

Last edited by Tex2002ans; 09-02-2022 at 12:22 AM.
Tex2002ans is offline   Reply With Quote