![]() |
#16 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,438
Karma: 5702578
Join Date: Nov 2009
Device: many
|
Danish may have this "feature" also but that means it had that feature (ie - WORDCHARS include the ".") before Sigil made any changes. The dictionary itself must be designed to support spellchecking abbreviations. The old Sigil dictionary did not. They new Sigil dictionary does. You will need to take that up with the Danish hunspell dictionary creator why they included the "." in WORDCHARS. in their dictionary.
We use it in the Sigil "en" one to support correct spelling checks for abbreviations, "etc.", "Mrs.", "Mr.", ... and to force the Hunspell dictionary to make better suggestions in english as not every single uppercase character is a valid standalone word but the old dictionary included them just to "simulate" abbreviation handling. |
![]() |
![]() |
![]() |
#17 | |
Imperfect Perfectionist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 616
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
Regards Kim |
|
![]() |
![]() |
![]() |
#18 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,438
Karma: 5702578
Join Date: Nov 2009
Device: many
|
Yes, understood, that should have been limited to "en" hunspell dictionaries.
That said, if a dictionary maintainer includes "." its aff WORDCHARS then it should be adding abbreviations to its word lists that are used to create their .dic files. That is exactly what WORDCHARS means. It is a list of selected punctuation and other special chars (digits, etc) that the wordlist designers want included as part of "words" when tokenizing text into separate "words". This greatly impacts the size and contents of the word lists employed to build up the dictionary. But some dictionary designers just copied theirs from other western dictionary aff files, probably not understanding exactly what it means. If that is the case for the Danish dictionary, just removing the "." from the list in WORDCHARS should get you what you want. If not, let me know and I will see about creating **non-gui** setting just in the sigil.ini file that could be manually added or removed to force ignoring that WORDCHARS period just at the ends of words. But that would be only for a future release not our upcoming one this weekend (hopefully). Last edited by KevinH; 08-31-2022 at 03:24 PM. |
![]() |
![]() |
![]() |
#19 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
![]() ![]() Thanks! Looking forward to seeing the changes! ![]() lol. Well, I assume Mbear is a "typical user"! The past year, I've been spending more time on author subreddits, like /r/selfpublishing, and see a lot of the common (new user) ebook questions/issues that keep getting brought up again and again. I've also been speaking with Gregg when he writes a new book (few times a year)—so I tend to get an author, non-expert, non-person-who-sits-on-MR-every-day-and-absorbs-every-post perspective:
- - - Side Note: Last month, I came across this fantastic talk: It discussed the 4 distinct sets of documentation:
and explained how each one serves a different purpose:
The talk completely blew my mind... and anyone who is interested in helping the Sigil ecosystem should watch it. ![]() (Personally, I'll be focusing more of my efforts into Tutorials+How-To Guides. We already have enough buried in Discussions/Reference.) - - - Quote:
+ sorting by Alphabetical/Count reveals all sorts of useful things. Each one has its own uses. For example, easily finding all US<->UK spellings or finding all "foreign" words:
The near-doubling of hits (and messed up counts) completely regressed such workflows. And, as I explained above, the sheer amount of work you can get done by:
is immense. - - - Side Note: It's very similar to the great table design principles shown in this fantastic GIF: And my 2 posts in:
"The Visual Display of Quantitative Information" by Edward Tufte lays it all out. When working with (tabular) data, you want to remove as much "visual clutter" as possible, and the data becomes much more readable/understandable. People think they need all those horizontal/vertical lines. No. People think they need the same info repeated on every single row. No. Once you begin removing redundancies, and use simple whitespace, things become infinitely more understandable. Less is more! - - - That's all I have to say on this subject for now. I'll be backing off for a while. |
||
![]() |
![]() |
![]() |
#20 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,340
Karma: 203719142
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I like Greg and try to help him out whenever I can. But that has nothing do with the fact that I don't really consider tech-weak author-types to be Sigil's primary audience. It's never going to be a turnkey solution for someone looking to jump in the ebook game. Some assembly will always be required.
|
![]() |
![]() |
![]() |
#21 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Okay, so I woke up today... and after giving it some thought, I needed to put these numbers into context.
I grabbed a 1.9 million word journal I've worked on and ran it through Spellcheck Lists. There were 4 categories:
Here are the results: Here's the chart of "sentence-enders" vs. "acronyms": Here's the raw data: Code:
Total Words = 1921208 Category Unique Differ. % Drop from Prev. Periods 75160 No Periods 61666 13494 17.95% Periods+No Nums 66399 No Periods+No Nums 54334 7332 11.04% # of "Sentence-End" 13494 # of "Acronyms" 188
These false positives:
- - - - Info (Acronyms) I considered "acronyms" as all inter-word periods:
Flaws in Counting Method I did not include URLs (this journal didn't have any) or many of the categories I listed in: I considered "sentence-ending" to be "only letters + 1 period at end": While this included the valid:
this is just a tiny fraction—maybe a few dozen—the vast majority are "duplicate word + period"s. - - - Side Note: Quick Acronyms In Sigil's Spellcheck Lists, searching for '.' instantly listed nearly all acronyms. This is "Show All Words" Checked/Unchecked: Pre-1.9.10: vs. Sigil 1.9.10: As you can see, in Sigil 1.9.10—no matter if "Show All" is on/off—it's still flooded with multiple thousands of extras:
Pre-Sigil 1.9.10:
Sigil 1.9.10 default:
- - - Acronym Differences (Sigil 1.9.10 vs. Pre-Change) When I compared between:
I got 14/188 different acronyms: Spoiler:
Sigil 1.9.10 shifted these from misspelled -> correct. The rest were all the same pre- + post-change. Acronym Recommendations (Sigil 1.9.10 vs. Pre-change) Yes, here, I agree, Sigil 1.9.10 handles the acronyms much better: Code:
Original 1.9.10 Pre A.C.L.U. A.C.L.U. ACOLYTE A.F.L. A.F. AWFUL C.I.A. C.I.A. ACACIA Y.W.C.A. Y.W.C.A. ACADEMY F.B.I. B.F.A. FABIAN F.B. FIB FIB U.S.A. U.S.A. USAGE U.S.S.R. U.S.S.R. SAUSSURE E.g. Eg Eng Ph.D. Ph. D. Ph. D.
- - - Thought: Hmmmm.... just spitballing ideas out there. Perhaps something could be done like:
This would still not be good for things like "Ph.D.", but I believe the vast majority of these true acronyms are of the: ALL CAPS-type:
This would then remove duplicates like:
and lower the cluttering by a ton (plus keeping accurate word counts for all non-acronym words!). - - - Thought #2: I still think a toggle for "Check Periods" would be great. Again, I can see some usage for this. (It actually helped me catch a few typos where I missed the closing period on a "U.S.S.R"!) But, just like the Numbers, it creates MANY more "false positives". Allowing it to be toggled ON/OFF would allow advanced users to use it, if needed. As you can see in the stats above:
- - - - Quote:
Why are Spellcheck Lists great? Because they list all unique words (1-grams) and display them in such a compact form! To see how/why n-grams are so powerful, see my recent posts in:
Again, I've written about all this stuff since Spellcheck Lists were first introduced back in 2013 (Sigil 0.7.0) based on my recommendation! You already had near-perfection for all these years. And then you:
And now, 2022, all Spellcheck Lists needed was a little tweak along the edge (acronyms)! But this new way... no. In my mind, it's 1 micro-step forward, 2 giant leaps backward! - - - Come on, KevinH (and Diap)... Listen to your bestest buddy Tex. When have I ever lead you wrong in all these years? ![]() Last edited by Tex2002ans; 09-02-2022 at 12:22 AM. |
|
![]() |
![]() |
![]() |
#22 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,438
Karma: 5702578
Join Date: Nov 2009
Device: many
|
Again, just install the normal en hunspell dictionary you get online.
And the extra period at the end of unknown words is not a false positive. It is a truly misspelt word that may or may not be an abbreviation. Your analysis also ignores the improved suggestions for unknown words that proper handling of abbreviations gets you (ie. no more single capital letters are treated a valid root words for suggestion generation). Leaving Sigil exactly as it is now allows either type of en US dictionary to be used by the user. So choose the one you want and install it once. Sigil will not overwrite it. If I can improve it in some future version I will but for now the new version is staying. So please just install a newer or older standard hunspell en dictionary. Last edited by KevinH; 09-01-2022 at 11:55 PM. |
![]() |
![]() |
![]() |
#23 | |
Imperfect Perfectionist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 616
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
I've edited my first posting and removed the wrong info about "REP $_ ._" so as not to confuse future readers of this tread. That said, I still don't think this setting should be the default, given the very few real misspelled abbreviations and acronyms vs. the number of misspelled words at the end of sentences. But it may be useful in some situations - a setting like the "Check numbers" in preferences in some future version of Sigil perhaps? Regards, Kim |
|
![]() |
![]() |
![]() |
Tags |
spellcheck |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Spellcheck Sigil 1.91 | Mbear | Introduce Yourself | 1 | 03-12-2022 04:39 PM |
Spellcheck | JoséEduardo | Calibre | 2 | 11-22-2018 12:25 AM |
Spellcheck in book view + selected text spellcheck | unfairrobot | Sigil | 2 | 12-19-2016 04:50 PM |
Multilanguage spellcheck | varlog | Sigil | 1 | 09-28-2016 11:45 PM |
SPELLCHECK NATION: Does SpellCheck have a dark side? | cbaehr | Self-Promotions by Authors and Publishers | 10 | 11-07-2010 12:45 PM |