01-26-2022, 04:10 AM | #61 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Your suggestion of removing the '-' from WORDCHARS sounds like the best way forward to me, Kevin.
Adding, and maintaining, another 50k words to the word list sounds like it is just creating unnecessary work if the result can achieved by other means. You saying the old dictionary worked this way answers something that was puzzling me. I was sure I had come across self-defence previously and it not being flagged as misspelled. |
01-26-2022, 08:32 AM | #62 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
I am still a little puzzled by how the suggested replacement words works.
Today I right clicked 'Theater' expecting the first word in the replacement list to be Theatre, but not so. The suggested replacements for Theater are: Heater Cheater T heater Th eater The ater The-ater Heather Thatcher Theatre does not even make it on the list. Yet if I right-click center the first word it offers is centre. Why is that? Last edited by Ashjuk; 01-26-2022 at 08:36 AM. |
Advert | |
|
01-26-2022, 10:43 AM | #63 | |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi Ashjuk,
No need to remove it as I used the scowl dictionary to spellcheck the 47000 long list and found under 100 that were not properly already covered. That is a list I can mange. So we should be good to go. Quote:
|
|
01-26-2022, 11:07 AM | #64 | |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
No spellcheck dictionary can tell what the original author meant. It is not a grammar checker and it does not know parts of speech, and nor can it see the words that surround it.
So they take the misspelled word and look for correct words that are only 1 edit distance away, then they try swapping adjacent chars, then they try inserting a new character at every position (including a space), then they run through the replacement table provided in the .aff file, then if phonetic changes are enabled in the aff, they will try those, and finally if still no good words found they will use ngrams to make a suggestion. So the word Theater which is not spelled correctly under en-GB is modified to try to look for "close" words that the original author could have meant. In this case we get the following list: Heater Cheater T heater Th eater Th-eater Thea ter Thea-ter Theatre Treater Heather which are all only 1 character edits, swaps, or insertions. "Thea" is being generated as a proper name for someone and "ter" is a known abbreviation for "Total Expense Ratio", etc. Having things like "ter" and "th" be considered "words" is generally not a good idea but scowl obviously included them at some point to make things like 105th work most likely. Here is where those pieces come from in scowl: english-upper.50:Th english-abbreviations.70:ter english-proper-names.50:Thea The spellchecker has no way to know what you meant by Theater, and based on small changes - it could be any of these valid combinations. When suggesting, the case is changed to match that of the misspelled words case which makes Thea (a woman's proper name) quite likely as it would need no case change. If you try the lowercase version "theater" you will get a much smaller list of suggestions as its case rules out proper first names. Hope this explains things a bit better. This is a great illustration of why adding proper first names to the spellchecker and bunches of abbreviations is not the best idea. You might want to try the size 60 en-GB dictionaries to see if you like those better as they should have fewer proper names and abbreviations without periods in them which should prevent them from being considered as valid suggestions. And it a misspelled word begins with an uppercase letter, be prepared for proper first names to be part of the suggestions. People complain when a first name is marked as not correctly spelled and they put pressure on the spellchecker to include it, but they really make no sense. One of the reasons is that some programs that use spellcheckers do not allow user word lists to be kept, edited, and used which in turn leads to main dictionary bloat. Hope this helps. Quote:
Last edited by KevinH; 01-26-2022 at 12:10 PM. |
|
01-26-2022, 12:10 PM | #65 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Thanks for the explanation, Kevin.
It's not an issue as I can easily correct it with a simple edit, but I was a little puzzled that what worked for center/centre seemed not to for Theater. As you say, if it had been theater instead of Theater then the first suggestion is theatre. I will remember in future to look out for capitalised words. Good news about the hyphenated words - hopefully we now have a definitive dictionary. |
Advert | |
|
01-26-2022, 12:22 PM | #66 |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
But I want to fix scowl's handling of abbreviations first. They stick in lots of abbreviations without the proper use of "." which put things like "ter" as a word, which is absurd.
The scowl authors do not seem to understand the meaning or use of the WORDCHARS in the .aff file as they do not include - or . which makes no sense at all. I will endeavour to fix that before any final release. |
01-26-2022, 01:19 PM | #67 |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Unfortunately, scowl includes over 700 abbreviations (even after ignoring the ones in all caps which are acronyms). None of them have an ending period that in any way would indicate that the word is abbreviated.
I have attached the list. The problem is people have started dropping the "." from the most common abbreviations like cm, mm, ft, in, Mrs, Mr, Dr, PhD, etc which just confuses things even more. I have attached the list of over 700 abbreviations that should probably either have an ending period added or be removed as they hide common spelling errors and end up polluting suggestions with nonsense. Feedback welcome as to how best to treat these "words". |
01-27-2022, 04:21 AM | #68 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
I have had a quick look at the list and agree with you, there is a lot that should have a period at the end.
I will try to find some time today to go through it and pull out the ones I think definitely should be included as a proper abbreviation. Sadly, this is a sign of the times. I now see people writing texts and comments online where, there is no punctuation used at all, and often sentences starting with lower case letters. |
01-27-2022, 08:54 AM | #69 |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
That would help. Thanks.
|
01-28-2022, 10:16 AM | #70 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
I have now worked my way through the abbreviations file and broken it down into four categories.
1. Items that MS Word accepted as is (probably acronyms etc.). 2. What I think, with the addition of a period, are valid abbreviations and are possible candidates for inclusion. 3. What appear to be valid abbreviations, but are probably rarely encountered. 4. Unknown items. |
01-28-2022, 10:59 AM | #71 |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Thanks so much! I will incorporate all if this into the new dictionary.
|
01-28-2022, 12:07 PM | #72 |
Fanatic
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
|
01-28-2022, 01:08 PM | #73 |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Anything is better than ignoring the ending period of abbreviations and making these word fragments be indistinguishable from normal words. It seems scowl default to strip the periods, and strip accents, and include way too many first names is going to lead to exactly the wrong behaviour, hidden typos and poor suggestions. For example, scowl considers "Th" to be a proper name. Any two letter proper name is going to lead to hiding spelling mistakes for words like "To", and "Th" is considered size 50. And "Th" is not the only one.
scowl has it good points but it also has some horrible points. I am thinking of removing all but the 100 top first names and all names less than 3 characters in length for this reason as well. User word lists are much better places for those things than a spellchecker dictionary. Last edited by KevinH; 01-28-2022 at 02:00 PM. |
01-28-2022, 03:16 PM | #74 |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Okay, took your abbreviations and the 12-dicts abbreviations and merged them. Then tweaked Sigil word parser for spellchecking to pass along ending periods as hunspell itself is smart enough to check for a valid abbreviation first and then strip it off and recheck in case just end of a sentence with a period.
Seems to help prevent errors hidden by bad abbreviations which also greatly helps improve suggestions as well. All of this will be part of the next release of Sigil which will be a beta release because of the large number of internal changes and new or completely redesigned features. |
01-28-2022, 03:19 PM | #75 | |
Sigil Developer
Posts: 7,643
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Sigil could include a single hunspell french dictionary for Windows and MacOS users since it is LGPL'd.
Is the classic one the one we should include? Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil newbie dictionary questions | michaelbr | Sigil | 8 | 12-06-2020 09:41 AM |
Content Dictionary update availability | ntamas | Amazon Kindle | 7 | 10-05-2019 01:03 PM |
Dictionary plugin in Sigil? For example Oxford-English Dictionary. | Rindr | Plugins | 2 | 03-04-2018 11:11 AM |
PRS-600 Dictionary not working after firmware update | pakiyabhai | Sony Reader | 1 | 10-24-2009 09:02 PM |
Update Problem and Dictionary Question | barryp | Sony Reader | 8 | 09-22-2008 05:56 AM |