Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 08-31-2022, 10:50 AM   #16
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
Danish may have this "feature" also but that means it had that feature (ie - WORDCHARS include the ".") before Sigil made any changes. The dictionary itself must be designed to support spellchecking abbreviations. The old Sigil dictionary did not. They new Sigil dictionary does. You will need to take that up with the Danish hunspell dictionary creator why they included the "." in WORDCHARS. in their dictionary.

We use it in the Sigil "en" one to support correct spelling checks for abbreviations, "etc.", "Mrs.", "Mr.", ... and to force the Hunspell dictionary to make better suggestions in english as not every single uppercase character is a valid standalone word but the old dictionary included them just to "simulate" abbreviation handling.
KevinH is offline   Reply With Quote
Old 08-31-2022, 01:16 PM   #17
elibrarian
Imperfect Perfectionist
elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.
 
elibrarian's Avatar
 
Posts: 616
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
Quote:
Originally Posted by KevinH View Post
You will need to take that up with the Danish hunspell dictionary creator why they included the "." in WORDCHARS. in their dictionary.
Yes, but that's not my point. What I mean is, that it's not correct to say that "any non-Sigil hunspell dictionary will suffice. It doesn't need to be old". (I haven't checked the other 120 or so hunspell dictionaries available, but any one of them could have the "feature" - now or in the future).

Regards

Kim
elibrarian is offline   Reply With Quote
Old 08-31-2022, 02:00 PM   #18
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
Yes, understood, that should have been limited to "en" hunspell dictionaries.

That said, if a dictionary maintainer includes "." its aff WORDCHARS then it should be adding abbreviations to its word lists that are used to create their .dic files. That is exactly what WORDCHARS means. It is a list of selected punctuation and other special chars (digits, etc) that the wordlist designers want included as part of "words" when tokenizing text into separate "words". This greatly impacts the size and contents of the word lists employed to build up the dictionary. But some dictionary designers just copied theirs from other western dictionary aff files, probably not understanding exactly what it means.

If that is the case for the Danish dictionary, just removing the "." from the list in WORDCHARS should get you what you want.

If not, let me know and I will see about creating **non-gui** setting just in the sigil.ini file that could be manually added or removed to force ignoring that WORDCHARS period just at the ends of words. But that would be only for a future release not our upcoming one this weekend (hopefully).

Last edited by KevinH; 08-31-2022 at 03:24 PM.
KevinH is offline   Reply With Quote
Old 08-31-2022, 06:09 PM   #19
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by KevinH View Post
ps. Since it is only a one line change that will not completely invalidate the user guide spellcheck images, I have aligned the count field right (numerically). Any change in column order will have to come in a future release just not our upcoming one.


Thanks! Looking forward to seeing the changes!

Quote:
Originally Posted by Turtle91 View Post
Way to throw mbear and GregBell under the bus!! Lol
lol. Well, I assume Mbear is a "typical user"!

The past year, I've been spending more time on author subreddits, like /r/selfpublishing, and see a lot of the common (new user) ebook questions/issues that keep getting brought up again and again.

I've also been speaking with Gregg when he writes a new book (few times a year)—so I tend to get an author, non-expert, non-person-who-sits-on-MR-every-day-and-absorbs-every-post perspective:
  • "Should I update Sigil to the latest?"
    • Yes! Of course!
    • (With very rare exceptions.)
    • And keep your LibreOffice up-to-date too!
  • "My Linux version is older and my Windows version is newer. Is this going to mess up my book?"
    • No! It's fine if they're slightly different versions.
  • "How do I do X again? I haven't used Sigil in 9 months."
  • "What's that trick you told me 4 years ago?"
  • [...]

- - -

Side Note: Last month, I came across this fantastic talk:

It discussed the 4 distinct sets of documentation:
  • Tutorials
  • How-To Guides
  • Discussions
  • Reference

and explained how each one serves a different purpose:
  • Tutorials = Learning-Oriented
  • How-To Guides = Problem-Oriented
  • Discussions = Understanding-Oriented
  • Reference = Information-Oriented

The talk completely blew my mind... and anyone who is interested in helping the Sigil ecosystem should watch it.

(Personally, I'll be focusing more of my efforts into Tutorials+How-To Guides. We already have enough buried in Discussions/Reference.)

- - -

Quote:
Originally Posted by Turtle91 View Post
I actually played around with this spellcheck stuff… and the periods after a duplicate word was a slight annoyance, but not that big a deal. Once you add the root word to a dictionary you can refresh the spelling list… it finds the word(s) in the dictionary and doesn’t display them in the misspelled list anymore. The refresh is very fast.
Toggling between 4 states:
  • ON/OFF = Show All Words
    • OFF = Only show misspelled words.
  • ON/OFF = Case-Insensitive Sort

+ sorting by Alphabetical/Count reveals all sorts of useful things.

Each one has its own uses. For example, easily finding all US<->UK spellings or finding all "foreign" words:

The near-doubling of hits (and messed up counts) completely regressed such workflows.

And, as I explained above, the sheer amount of work you can get done by:
  • pure visuals
  • + pattern recognition

is immense.

- - -

Side Note: It's very similar to the great table design principles shown in this fantastic GIF:

And my 2 posts in:

"The Visual Display of Quantitative Information" by Edward Tufte lays it all out.

When working with (tabular) data, you want to remove as much "visual clutter" as possible, and the data becomes much more readable/understandable.

People think they need all those horizontal/vertical lines. No.

People think they need the same info repeated on every single row. No.

Once you begin removing redundancies, and use simple whitespace, things become infinitely more understandable.

Less is more!

- - -

That's all I have to say on this subject for now. I'll be backing off for a while.
Tex2002ans is offline   Reply With Quote
Old 08-31-2022, 07:31 PM   #20
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,341
Karma: 203719142
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
I like Greg and try to help him out whenever I can. But that has nothing do with the fact that I don't really consider tech-weak author-types to be Sigil's primary audience. It's never going to be a turnkey solution for someone looking to jump in the ebook game. Some assembly will always be required.
DiapDealer is online now   Reply With Quote
Old 09-01-2022, 11:10 PM   #21
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Okay, so I woke up today... and after giving it some thought, I needed to put these numbers into context.

I grabbed a 1.9 million word journal I've worked on and ran it through Spellcheck Lists.

There were 4 categories:
  • Yes/No Periods
    • Sigil 1.9.10 default vs. Pre-1.9.10.
  • Yes/No Numbers
    • "Check Numbers" option on/off.

Here are the results:

Click image for larger version

Name:	Periods.and.Nums.Comparison.png
Views:	213
Size:	10.3 KB
ID:	196221

Here's the chart of "sentence-enders" vs. "acronyms":

Click image for larger version

Name:	Sentence-Ender.vs.Acronyms.png
Views:	217
Size:	7.9 KB
ID:	196224

Here's the raw data:

Code:
      Total Words = 1921208

Category             Unique   Differ.  % Drop from Prev.

Periods               75160
No Periods            61666    13494    17.95%

Periods+No Nums       66399
No Periods+No Nums    54334     7332    11.04%


# of "Sentence-End"   13494
# of "Acronyms"         188
The way I see it:
  • Yes, Sigil 1.9.10 made spellchecking of ~200 acronyms (0.2%) more accurate...
  • but created ~11–18% more "false positives".

These false positives:
  • Create "visual clutter"
  • + exacerbate all the problems I mention in the previous posts, making every step of the "proofing" chain slower + less effective.

- - - -

Info (Acronyms)

I considered "acronyms" as all inter-word periods:
  • Common Phrases
    • i.e. + e.g.
    • a.m. + p.m.
    • A.D. + B.C.
    • Ph.D.
  • First + Middle Initial
    • F.A. (Hayek)
    • W.E.B. (Du Bois)
  • Acronyms
    • F.B.I.
    • C.I.A.
    • U.S.A.
    • U.S.S.R.
    • U.C.L.A.
  • States
    • N.Y.
    • N.J.
    • P.A.

Flaws in Counting Method

I did not include URLs (this journal didn't have any) or many of the categories I listed in:

I considered "sentence-ending" to be "only letters + 1 period at end":

While this included the valid:
  • Mr. / Mrs. / Dr.
  • St. (Saint / Street)

this is just a tiny fraction—maybe a few dozen—the vast majority are "duplicate word + period"s.

- - -

Side Note: Quick Acronyms

In Sigil's Spellcheck Lists, searching for '.' instantly listed nearly all acronyms.

This is "Show All Words" Checked/Unchecked:

Pre-1.9.10:

Click image for larger version

Name:	Pre-Sigil.-.Spellcheck.List.-.Checked.Show.All.png
Views:	210
Size:	8.8 KB
ID:	196222 Click image for larger version

Name:	Pre-Sigil.-.Spellcheck.List.-.Unchecked.Show.All.png
Views:	219
Size:	8.9 KB
ID:	196223

vs. Sigil 1.9.10:

Click image for larger version

Name:	Sigil.1.9.10.-.Spellcheck.List.-.Checked.png
Views:	214
Size:	9.8 KB
ID:	196225 Click image for larger version

Name:	Sigil.1.9.10.-.Spellcheck.List.-.Unchecked.png
Views:	210
Size:	9.0 KB
ID:	196226

As you can see, in Sigil 1.9.10—no matter if "Show All" is on/off—it's still flooded with multiple thousands of extras:
  • Checked
    • = Every acronym
    • + every "word + period"
  • Unchecked
    • = Every acronym
    • + nearly every "ALL CAPS + period"
    • + every "misspelled + period".
      • Including nearly everyone's last names, like "Clayton."!

Pre-Sigil 1.9.10:
  • This was a split second skimming.
  • The list was almost pure true acronyms.

Sigil 1.9.10 default:
  • ~200 true acronyms are buried under thousands of sentence-enders.
    • Marginally better when toggling "Show All" ON/OFF.

- - -

Acronym Differences (Sigil 1.9.10 vs. Pre-Change)

When I compared between:
  • Misspelled On/Off

I got 14/188 different acronyms:

Spoiler:
Code:
A.D.
a.m.
A.M.
B.C.
E.g.
e.g.
E.G.
m.p.h.
M.P.H.
N.P.
n.p.
Y.W.C.A.


Sigil 1.9.10 shifted these from misspelled -> correct.

The rest were all the same pre- + post-change.

Acronym Recommendations (Sigil 1.9.10 vs. Pre-change)

Yes, here, I agree, Sigil 1.9.10 handles the acronyms much better:

Code:
Original    1.9.10     Pre

A.C.L.U.    A.C.L.U.   ACOLYTE
A.F.L.      A.F.       AWFUL
C.I.A.      C.I.A.     ACACIA
Y.W.C.A.    Y.W.C.A.   ACADEMY
F.B.I.      B.F.A.     FABIAN
F.B.        FIB        FIB
U.S.A.      U.S.A.     USAGE
U.S.S.R.    U.S.S.R.   SAUSSURE
E.g.        Eg         Eng
Ph.D.       Ph. D.     Ph. D.
but, again, at what cost?
  • ~0.2% of cases getting more accurate recommendations.
    • And acronyms are hard to even find in the List now!
  • vs. ~10–20% guaranteed "visual clutter".
    • In all use-cases of Spellcheck Lists.

- - -

Thought: Hmmmm.... just spitballing ideas out there.

Perhaps something could be done like:
  • If period at end + all letters are capital
    • Consider '.' part of word + use better recommendations.
  • If period at end + any letters are lowercase
    • Trim '.' off end + act the old way.

This would still not be good for things like "Ph.D.", but I believe the vast majority of these true acronyms are of the:

ALL CAPS-type:
  • F.B.I.
  • C.I.A.
  • A.D. / B.C.

This would then remove duplicates like:
  • word.
  • Clayton.
  • Rothbard.
  • Jumbled.

and lower the cluttering by a ton (plus keeping accurate word counts for all non-acronym words!).

- - -

Thought #2: I still think a toggle for "Check Periods" would be great.

Again, I can see some usage for this.

(It actually helped me catch a few typos where I missed the closing period on a "U.S.S.R"!)

But, just like the Numbers, it creates MANY more "false positives".

Allowing it to be toggled ON/OFF would allow advanced users to use it, if needed.

As you can see in the stats above:
  • Periods On adds ~10–20% clutter.
  • Numbers On adds ~12%+ clutter.

- - - -

Quote:
Originally Posted by KevinH View Post
And if you want to make accurate counts, I recommend using the new Saved Search Group Counts Report feature and not trying to use SpellCheck for that. It was added for just that purpose.
This is madness!

Why are Spellcheck Lists great?

Because they list all unique words (1-grams) and display them in such a compact form!

To see how/why n-grams are so powerful, see my recent posts in:

Again, I've written about all this stuff since Spellcheck Lists were first introduced back in 2013 (Sigil 0.7.0) based on my recommendation!

You already had near-perfection for all these years. And then you:

And now, 2022, all Spellcheck Lists needed was a little tweak along the edge (acronyms)!

But this new way... no. In my mind, it's 1 micro-step forward, 2 giant leaps backward!

- - -

Come on, KevinH (and Diap)...

Listen to your bestest buddy Tex. When have I ever lead you wrong in all these years?

Last edited by Tex2002ans; 09-02-2022 at 12:22 AM.
Tex2002ans is offline   Reply With Quote
Old 09-01-2022, 11:24 PM   #22
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
Again, just install the normal en hunspell dictionary you get online.

And the extra period at the end of unknown words is not a false positive. It is a truly misspelt word that may or may not be an abbreviation.

Your analysis also ignores the improved suggestions for unknown words that proper handling of abbreviations gets you (ie. no more single capital letters are treated a valid root words for suggestion generation).

Leaving Sigil exactly as it is now allows either type of en US dictionary to be used by the user. So choose the one you want and install it once. Sigil will not overwrite it.

If I can improve it in some future version I will but for now the new version is staying. So please just install a newer or older standard hunspell en dictionary.

Last edited by KevinH; 09-01-2022 at 11:55 PM.
KevinH is offline   Reply With Quote
Old 09-02-2022, 04:04 AM   #23
elibrarian
Imperfect Perfectionist
elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.
 
elibrarian's Avatar
 
Posts: 616
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
Quote:
Originally Posted by KevinH View Post
If that is the case for the Danish dictionary, just removing the "." from the list in WORDCHARS should get you what you want.

If not, let me know and I will see about creating **non-gui** setting just in the sigil.ini file that could be manually added or removed to force ignoring that WORDCHARS period just at the ends of words. But that would be only for a future release not our upcoming one this weekend (hopefully).
Removing the "." from WORDCHARS in the .aff file, and a fresh start of Sigil does indeed do the trick.

I've edited my first posting and removed the wrong info about "REP $_ ._" so as not to confuse future readers of this tread.

That said, I still don't think this setting should be the default, given the very few real misspelled abbreviations and acronyms vs. the number of misspelled words at the end of sentences. But it may be useful in some situations - a setting like the "Check numbers" in preferences in some future version of Sigil perhaps?

Regards,

Kim
elibrarian is offline   Reply With Quote
Reply

Tags
spellcheck

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spellcheck Sigil 1.91 Mbear Introduce Yourself 1 03-12-2022 04:39 PM
Spellcheck JoséEduardo Calibre 2 11-22-2018 12:25 AM
Spellcheck in book view + selected text spellcheck unfairrobot Sigil 2 12-19-2016 04:50 PM
Multilanguage spellcheck varlog Sigil 1 09-28-2016 11:45 PM
SPELLCHECK NATION: Does SpellCheck have a dark side? cbaehr Self-Promotions by Authors and Publishers 10 11-07-2010 12:45 PM


All times are GMT -4. The time now is 03:23 PM.


MobileRead.com is a privately owned, operated and funded community.