Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 01-17-2022, 07:55 AM   #1
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
Sigil Dictionary Update

Following on from my post in the future items thread regarding dictionaries I have now worked my way through my default user dictionary comparing this against the US dictionary as bundled with Sigil 1.8.

I checked for validity by using the on-line version of the Merriam-Webster dictionary, and I now have a file that I consider are possible candidates for future inclusion.

I have also a short list of new words that have come into common usage that could also be included.

I will now check my default file against a UK dictionary for updating the en-GB dictionary currently bundled.

I have uploaded the files processed so far to my Google Drive - https://drive.google.com/drive/folde...Vo?usp=sharing
Ashjuk is offline   Reply With Quote
Old 01-17-2022, 01:07 PM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Thanks for posting those links. I ran both lists through the scowl search website and it found some of those words in scowl in full but not that many.

It seems that Hunspell is in a sorry state itself. Back before it became Hunspell, it was Myspell and MySpell had a unmunch tool that took the current .dic and .aff file and created a corpus wordlist from it. Similarly it also had a munch tool that didthe opposite. Unfortunately, these tools are broken for hunspell now as hunspell changed and greatly expanded how prefixes and suffixes were defined in the .aff file and changed how compounds words are supported. They do not document their new format anyplace. This completely broke munch and unmunch, but no one there felt it was important enough to fix them.

They do have a wordforms program which works only on a single base word at a time but that does grok the latest extended .aff file formats, but very slowly. I am going to give that a try, alongside their affixcompress.

There is even a bug report in Hunspell about this that has been open for literally years with recent posts but no solutions.

https://github.com/hunspell/hunspell/issues/404


So it appears most dictionary maintainers have to just drop prefixes and even affixes and just add new words on the end of the .dic file, which completely defeats the whole purpose of affix compression to shrink wordlists for much faster access and much smaller memory footprints.

So this leaves me at a bit of a quandary. The tools are not there to do things properly with current hunspell. Luckily hunspell can still read and work with the older .aff format that MySpell developed which is more than enough for many languages like english, spanish, italian, etc but not for languages like hungarian, polish, etc.

So I can use the latest scowl wordlists up through and including 70, and then add some curated additional words and then munch them with the MySpell aff file to create a proper .dic that will still work with hunspell.

Sad really.

Update: I tried hunspells wordforms script but it is buggy enough that it will not produce words that only exist as root words (no prefixes or suffixes). So running wordforms on "aflame", "aback", "abet" and etc. will not produce output that the root word is itself correct.

That makes automating the generation of a word list much harder than it needs to be. Worst of all, all it really does is generate all possible prefix and suffix words even complete nonsense and pass the hunspell spell checker. There are no conditions for adding the prefix or suffix even used. It is quadratic or higher in time. Languages with lots of root words , lots of prefixes, lots of suffixes and/or compound words would never be able to use it.

I will have to write real code to do this properly.

Last edited by KevinH; 01-17-2022 at 03:12 PM.
KevinH is offline   Reply With Quote
Advert
Old 01-18-2022, 04:39 AM   #3
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
I have to admit I was surprised myself that the majority in my list were derivations of existing words and not unknown.

I know nothing about how spellcheckers work, but I was puzzled that they do not seem to be able to detect the simple addition of 's' to pluralise a word, or an apostrophe to denote possession. I have deleted a lot from my list that were just that.

From what you have said it appears that by trying to fix this might open up a huge can of worms. Perhaps it would be best for all just to leave things as they are and carry on adding words to a default user dictionary as they are encountered.

I will probably abandon the checking for the UK. It's a very time consuming operation checking each word in the list against a dictionary. But if you do manage to find a way of amending the word lists I will pick it up again.

Thanks for taking a look anyway.
Ashjuk is offline   Reply With Quote
Old 01-18-2022, 04:57 AM   #4
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
Not all Hunspell dictionaries can be unmunched because of compounding. Even without compounding, some languages have extremely productive affixes (for example, some slavic languages use a separate adjective for each ordinal numeral, each of which can take 10-15 different adjectival endings).
Sadly, munching and unmunching left in ispell days of simple affix files (like the English one from the ispell documentation).
Sarmat89 is offline   Reply With Quote
Old 01-18-2022, 08:33 AM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
@Sarmat89
Understood, but munch and unmunch in MySpell was much better than ispell which did not handle cross products well at all. But many languages as you said need much more.

@Ashjukj
As for simple plurals and possessives, they should in fact be added to the wordlist corpus. When affix compressed (munched) the plurals are properly detected and the suffixes are stripped and replaced with a flag which keeps the root word list (.dic) small but word coverage large. That is the whole point of affix compression.

Do please do generate your UK wordlist (no need to check each one) we can do that via scowl. Include all variations of the word you have encountered.

We can fix the en based dictionaries.

Last edited by KevinH; 01-18-2022 at 09:41 AM. Reason: fixed my typos and made it clearer who I was responding to
KevinH is offline   Reply With Quote
Advert
Old 01-18-2022, 09:08 AM   #6
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
Quote:
Originally Posted by KevinH View Post
As for simple plurals snd posessives, they should in fact be added to the wordlist corpus. When affix compressed (munched) the plurals are properly detected and the suffixes are stripped and replaced with a flag hich keeps the root word list (.dic) small but word coverage large. That is the whole point of affix compression.
That is what I would have expected, but not what I experience in practice -
as you can see from these screen shots.


The word 'swines' has been highlighted as misspelled.


Removing the 's' corrects the problem


But (as you can see) swines is the valid plural of swine.

Quote:
Originally Posted by KevinH View Post
Do please do generate your UK wordlist (no need to check each one) we can do that via scowl. Include all variations of the word you have encountered.

We can fix the en based dictionaries.
OK, I will upload it later.
Attached Thumbnails
Click image for larger version

Name:	Sigil 001.jpg
Views:	434
Size:	41.2 KB
ID:	191673   Click image for larger version

Name:	Sigil 002.jpg
Views:	432
Size:	37.6 KB
ID:	191674   Click image for larger version

Name:	Sigil 003.jpg
Views:	435
Size:	14.7 KB
ID:	191675  
Ashjuk is offline   Reply With Quote
Old 01-18-2022, 10:15 AM   #7
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
In common US English, the plural form of "swine" is in fact "swine" not "swines". I would guess your online dictionary is not quite correct in regards to this particular word, especially for US english.

But this may also be a point of divergence between GB vs US based dictionaries.

Please understand, how a dictionary is built or rebuilt is different from what words it actually recognizes as correct. To build a typical western dictionary, you start with a giant list of correctly spelled words in any unique variation ("windows" is different from "window", etc). These can be generated from edited material of intelligent scholars, edited newspapers, and etc. (the corpus).

Once you have this list (call it the corpus word list), you then define how your language makes plural forms, possessive forms, what suffixes are typically attached to the end, what prefixes are appended to the beginning, what rules or conditions need to be met for those to be added, etc. This forms the bulk of the .aff file. Please note, that after stripping prefixes and suffixes following the rules and conditions, the resulting root word *MUST" exist in in the wordlist on its own (this is not true for general compression, just affix compression).

As a next step you look through the list removing any "rare" words that are similar enough to "common" words that they could be generated by a simple typo (short edit distance). This is what scowl excels at (it groups words based on usage frequency). The remaining list of words is a "working set" for your language.

This working set is used in conjunction with the .aff (affix rules) to "compress" the working set to a set of "root" words plus flags to mark which affix rules are allowed to be applied to the root word. If for example a suffix fits the rules for a word but the variation of the word with that suffix was not in the "working set" then no flag is added. This process was called "munching" the "working set".

The result is a .aff file (with lots of extra pieces added to help it make better suggestions, handle parts of speech id, phonetic based spelling error corrections) and a .dic file (which is a list of root words plus any flags).

This works well for the English family of languages and many others originally covered by MySpell or ispell dictionaries. It does not work on languages that allow any combination of words to be itself a word, any combination of prefixes to make new prefixes, any combination of suffixes to make new suffixes, any combinations with pieces that are not actually a word in and of itself, etc.

Hungarian has this issue which is why when I retired from MySpell and the OpenOffice lingucomponent project, Hunspell absorbed my old MySpell codebase that was used in OpenOffice and Mozilla. They then had to greatly extend the basic affix compression approach (and the format of the .aff file as a result) to try to do better than MySpell/ispell ever could with those languages. In doing so they broke the original ability to unmunch a dictionary and really have no way to do that now.

This is great for many many languages that MySpell/ispell never supported (or did not support well), but really does not help "en" and other original MySpell languages where the approach of a wordlist corpus built from edited and scholarly texts, books and other materials is the right approach.

This is where we are at now.

I can still unpack "en" based languages, build a wordlist. I can look at where these words are now in common usage frequency and build an improved "working set" and affix compress them back to improve the en dictionaries. This approach would help with any other language that started out life as a MySpell based dictionary.

Hope this explains things better.

Quote:
Originally Posted by Ashjuk View Post
That is what I would have expected, but not what I experience in practice -
as you can see from these screen shots.


The word 'swines' has been highlighted as misspelled.


Removing the 's' corrects the problem


But (as you can see) swines is the valid plural of swine.


OK, I will upload it later.
KevinH is offline   Reply With Quote
Old 01-18-2022, 11:39 AM   #8
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
For the record, it took a while but I was able to unpack the current Hunspell en_US.aff and en_US.dic into its "working list" of words using repeated use of Hunspell's "wordforms" one root word at a time. It took quite a while to do that.

As it turns out the current Hunspell dictionary covers 124,340 different words with 52890 root words.

My old MySpell en_US dictionary based on Kevin Atkinson's aspell wordlists and things actually covers:

152468 different words with 62072 root words.

Now coverage isn't everything but the older wordlist based on the author's of scowl used at that time used has greater coverage.

So I am not sure why the Hunspell dictionary has regressed so much (from a spell check perspective only). I would have thought as new words are constantly being created, that the coverage of the latest hunspell en_US dictionary would be larger than the 124k words.

I will compare the two lists to each other and scowl at different frequency levels to try to come up with a good compromise.

Update:

It seems the difference in coverage are many. The Hunspell en US dictionary includes lots of proper first names (where spelling differences typically abound), and some rare forms of words that may not merit inclusion when compared to the older MySpell dictionary.

Here are a few examples:

+Aachen's
+Aaren
+Aaren's
+Aarhus
+Aarhus's
+Aarika
+Aarika's
+Abagael
+Abagael's
+Abagail
+Abagail's
...
+Yasmeen
+Yasmeen's
+Yasmin
+Yasmin's

and things like

+allegoricalness
+allegoricalness's

So the question remains, does anyone expect a spelling dictionary to know all of the variations of people's first name? I would think not.

I would think those are things better suited for the User Dictionary not the main en_US dictionary.

Quite the mess of things indeed. Perhaps using scowl and beginning from scratch would be better.

Any thoughts on what words are best suited for a main dictionary given all of the above, welcome from anyone.

Last edited by KevinH; 01-18-2022 at 12:28 PM.
KevinH is offline   Reply With Quote
Old 01-18-2022, 12:37 PM   #9
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
Thanks for the explanation, Kevin. I think I have a slightly better understanding of how spellchecking works now.

There is always going to be this issue with US vs UK - as Shaw is reputed to have said "England and America are two countries separated by a common language".

I know you said not to bother but I have opened my UK word list in MS Word and I am using that to spellcheck it. Using that it has returned less than 20% of them as misspelled, which I am checking against a couple of on-line dictionaries.

It would be great to be able to use the OED as a reference, but unfortunately my budget does not run to that.

I would never expect a dictionary to include people's names. I put them in a separate user Names dictionary as I come across them.
Ashjuk is offline   Reply With Quote
Old 01-18-2022, 12:40 PM   #10
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Just to test if scowl includes coverage of common first names:

Columns are:
[Word In en_US Found In Notes Google Books Stats[*] Should Include Frequency (per million) Newness]


Aachen's YES
Aaren NO * 0.0043 1.7
Aaren's NO
Aarhus NO en_US-large *** 0.5090 1.0
Aarhus's NO
Aarika NO * 0.0002 0.5
Aarika's NO
Abagael NO * 0
Abagael's NO
Abagail NO ** 0.0163 1.1
Abagail's NO
...
Yasmeen NO ** 0.0576 1.9
Yasmeen's NO
Yasmin NO *** 0.2611 1.5
Yasmin's NO

So unless your first name overlaps with a city name or region name, scowl does not include it although they are detected in Google's search.

I would think common first names that do not coincide with rivers, counties, countries, states, regions, etc should not be in a spelling tool like hunspell's en_US dictionary.
KevinH is offline   Reply With Quote
Old 01-18-2022, 01:16 PM   #11
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Actually, I have that backwards! It was the old MySpell raw word list that has so many first names not the Hunspell one.

Either way, I think restarting with the scowl lists and then carefully adding curated words makes the most sense.
KevinH is offline   Reply With Quote
Old 01-18-2022, 01:38 PM   #12
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Even worse ... I checked the long list of First names that MySpell has that hunspell does not into Apple's Pages app and checked them and most of them are actually marked as correct based on the official Apple spellchecker as built into Pages!

So it appears that people's first names are included in in many official spellcheckers.

So I am really at a loss here. Should they be kept or removed? They are not part of scowl but official Apple spellcheckers say they are okay. I wonder how platform specific spellchecking is. I do not have Word to compare it and LibreOffice uses hunspell.

Then of course, they are the differences in word lists attributed to urban slang. For example "zorch" or "zorched". I had to look thatone up and the only place I found itwas an "urban dictionary" and that is meant "ruined" or"burnt out" as in you "zorched your iphone".

Based on official commercial spellcheckers in Word and Pages, there are major differences. So spell checker dictionary building is quite subjective and so comparisons of "quality" are very hard to make. For example, if I am writing a formal document or dissertation, "zorch" should probably be marked wrong as it is just one character away from "porch". But if I am writing modern fiction, "zorch" being correct might be okay!

Last edited by KevinH; 01-18-2022 at 02:17 PM.
KevinH is offline   Reply With Quote
Old 01-19-2022, 04:47 AM   #13
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
It appears that the word base in applications such as Pages and Word are far more comprehensive than the Hunspell one.

Pasting your list in #10 into my copy of Word 2010 only Aaren, Aarika and Abagael are shown as misspelled. This is also borne out by the fact that when I opened my list of words as flagged by Hunspell as being misspelled in Word probably less than 20% were showing as being so.

Personally I don't think it's a good idea to include people's names in a dictionary. These days it seems people seem to want to change the spelling of their 'common' name just stand out from the crowd - and some are just completely bizarre.

As for 'zorch' I would have put that into my slang dictionary. Its use is probably quite common in certain groups but not a word in wide usage (at least here in my part of the UK).

Just going back a few posts to my example of how removing the 's' affects the spellchecker, and we debated the use of swine vs swines as a plural. I was thinking further about this and it is all dependant on how you use the word swine.

1. Swine as a pig: In that situation I would use swine as the plural - 'a herd of swine'.

2. Swine as in a person behaving badly: In that situation I would use swines - 'you bunch of swines!'

Anyway I really had not intend it to become this complicated. I thought that it would have just been a simple matter of just adding some new words to the default dictionary. If it is going to take a lot of work then forget the whole thing, and we can just carry on using our user-defined dictionaries.
Ashjuk is offline   Reply With Quote
Old 01-19-2022, 08:00 AM   #14
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Ashjuk View Post
Following on from my post in the future items thread regarding dictionaries I have now worked my way through my default user dictionary [...].
Thanks for the list. I'll take a closer look.

But there's things such as:
  • Columbians

You can see why that's not commonly accepted, because of the typo with:
  • Colombians
    • People from the South American country of "Colombia".
    • Notice the 'o'!!!

That typo was so sneaky, it was even sitting in Sigil forever until I spotted it:

I even wrote a whole LanguageTool bug/request about this:

Here's a few of the common phrases I jotted down:

Columbia (with a 'u')
  • British Columbia
  • District of Columbia
  • Columbia River

but almost everything else that's popular is actually speaking about the country:

Colombia (with an 'o')
  • Colombian peso
  • Colombian government
  • Colombian women
  • Colombian military
  • Colombian drugs
  • [...]

You can see how the spellchecker might want to err on the side of showing that common error... where the grammarchecker can take into account the surrounding words!

If you accidentally wrote:
  • Columbian peso

the grammarchecker will go: "Uhh, did you mean the country?"

Or if you accidentally wrote:
  • The Columbians wallet fell to the floor.

you'd want spellcheck to say: "Uhh, did you mean Colombian + Colombian's + Colombians + Columbia's".

These words are WAY more likely (see Google n-grams).

Quote:
Originally Posted by KevinH View Post
So I am really at a loss here. Should they be kept or removed?
See lots of the fantastic discussion 6 years ago in Firefox Bug #1235506: "en-US dictionary: Additional Mozilla words need to be cleaned up".

Quote:
Originally Posted by KevinH View Post
Even worse ... I checked the long list of First names that MySpell has that hunspell does not into Apple's Pages app and checked them and most of them are actually marked as correct based on the official Apple spellchecker as built into Pages!

So it appears that people's first names are included in in many official spellcheckers.
I side with SCOWL.

Names that are extremely common, like names like:
  • Einstein
  • Newton
  • Aristotle
  • Beethoven

Yes.

Names of famous cities/places:
  • Everest
  • Paris
  • Berlin
  • Washington

Yes.

But getting into rarer and more extreme names? And every name under the sun?

No.

That's why SCOWL's "size 60" list is used as the default. SCOWL has already gone through and included the most popular names/places.

As you rise up through "size 70" (Large) and "size 80", the list of "correctly spelled names" explodes.

In most cases though, these rarer names/spellings only make sense in very specific contexts.

- - -

Side Note: There is also the case of:

Company Names

"Are company names words?" Most dictionaries say NO.

A word like "Facebook" isn't an actual, definable word, and shouldn't belong in the actual dictionary.

... but in the context of spellchecking, yes, some famous companies such as:
  • Microsoft
  • Google
  • Facebook
  • Coca-Cola
  • IBM
  • NVIDIA
  • Qualcomm

or programs:
  • Firefox
  • Photoshop
  • Linux

should be included as exceptions.

(This is where some of LanguageTool's lists help... but they go too far and begin accepting TOO MANY company names. Again, I agree/side with SCOWL's assessment. See some discussion about LT's lists like wordlist Issue #181])

Acronyms

Similar situation with acronyms. Super common ones that exist in dictionaries?
  • FBI
  • CIA
  • USA
  • JPG
  • [...]

... but accepting every acronym under the sun? No!

(LT leans VERY far into that direction. Accept as much as they can, because they're worried mostly about the grammar squigglies, not the spelling.)

- - -

Side Note #2: Anyway, some of this is also described in detail in the:

- SCOWL Readme

especially the section on "proper-names".

There's also a list of how many new Words vs. Names (+ Total) are added in each list:

Code:
  Size   Words       Names    Running Total  %
   10    4,425          13        4,438     0.7
   20    8,126           0       12,564     1.9
   35   37,260         220       50,044     7.6
   40    6,858         489       57,391     8.7
   50   25,289      18,683      101,363    15.4
   55    6,487           0      107,850    16.4
   60   14,552         850      123,252    18.7
   70   35,294       7,897      166,443    25.3
   80  144,164      33,368      343,975    52.3
   95  227,630      86,630      658,235   100.0
You can see by "size 50", there's the most common ~20k names found in actual dictionaries, like:
  • Einstein
  • Newton
  • Hawking

But beyond the defaults ("size 60"), the names begin exploding, leading to MUCH more chances of false positives.

Like size 70 begins introducing:
  • Addressograph
  • Adelbert
  • Adigranth
  • Beaverboard
  • Benedicite
  • Blackmun
  • Pianolas
  • [...]

size 80 begins introducing smaller cities/towns (I believe everything over 10k population?):
  • Alstead
  • Altaloma
  • Amburgey
  • Amherstdale
  • Plumtree
  • Spitalfields
  • [...]

and by "size 95", you're getting all these obscure animal/biology terms too (Genuses):
  • Heterodontus
  • Hexamita
  • [...]

... Again, SCOWL has already done the "most commonly used words" legwork!!! Stick with the defaults.

Everything beyond that point would be the very rare exceptions! (ALthough you might catch stuff like "Facebook", etc.)

Quote:
Originally Posted by KevinH View Post
They are not part of scowl but official Apple spellcheckers say they are okay. I wonder how platform specific spellchecking is. I do not have Word to compare it and LibreOffice uses hunspell.
Word misses many things Sigil catches.

Sigil misses many things Word catches.

InDesign misses things Sigil/Word catch.

(This is why I recommend a layered approach when spellchecking! 1 (or more) rounds of spellchecking in multiple programs.)

Quote:
Originally Posted by KevinH View Post
Then of course, they are the differences in word lists attributed to urban slang. For example "zorch" or "zorched". I had to look thatone up and the only place I found itwas an "urban dictionary" and that is meant "ruined" or"burnt out" as in you "zorched your iphone".
Slang, "Hacker" words, 1337speak, and all this other stuff gets relegated to other "variant" lists (or not at all).

Again, these things are mostly obscure subcultures, or not "actual" English!

Perhaps one day, the terms rise in popularity and become "actual words" in the general language... but definitely not polluting default spellcheck lists. :P

These spellchecking dictionaries have to lean much more towards the conservative side, because it's much better to:
  • CATCH the typo (and recommend ACTUAL WORDS in the right-click)

than to:
  • MISS the error (or recommend junk like "zorched")

The default lists should be "size 60", leaning more towards the conservative side, with very rare exceptions added on top.

Quote:
Originally Posted by KevinH View Post
Based on official commercial spellcheckers in Word and Pages, there are major differences. So spell checker dictionary building is quite subjective and so comparisons of "quality" are very hard to make.
Yep. There's the balancing act between:
  • "red squigglies on too many words" vs. "missing too many actual errors"

This is why SCOWL strongly bases itself on actual English popularity+usage, and heavily curates new additions.

(Like we discussed in the previous topic [and I went into detail in my Reddit posts]... a small fraction of all possible words covers more than 90%+ of real-life usage.)

And again, I wouldn't worry too much about Sigil's default lists, because we have the fantastic Spellcheck Lists. This is the ultimate tool, and allows you to spellcheck an entire book WAY WAY faster than those one-by-one methods.

(You could even use it to quickly find "misspelled words" + Add to Dictionary or Ignore. Similar to the trick I did back in 2019 to catch "foreign words".)

Quote:
Originally Posted by KevinH View Post
For example, if I am writing a formal document or dissertation, "zorch" should probably be marked wrong as it is just one character away from "porch". But if I am writing modern fiction, "zorch" being correct might be okay!
And "scorch".

(The 's' is extremely close to the 'z'.)

More Side Note: This kind of spelling (+autocorrecting) mess is also becoming much more prevalent with the keyboards+swiping on phones.

Do you know how many actual typos occur because of the virtual keyboard... and then how many autocorrect typos get introduced? Way, way too many.

Especially frustrating are the valid words where it magically adds a space too! (away -> a way).

(This has been angering me so much, that for the last year I've been compiling a big ol' list to submit to LT... soon... soon. )

Last edited by Tex2002ans; 01-19-2022 at 09:11 AM.
Tex2002ans is offline   Reply With Quote
Old 01-19-2022, 08:53 AM   #15
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
I have now uploaded the UK word list to my Google Drive (link as previous).

I checked these by opening the file in Word 2010 (set with UK spelling) and checking those that Word highlighted as misspelled.

I used the following as references to confirm validity.
https://www.lexico.com/
https://www.collinsdictionary.com/
and a digital copy of the OED
Ashjuk is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sigil newbie dictionary questions michaelbr Sigil 8 12-06-2020 09:41 AM
Content Dictionary update availability ntamas Amazon Kindle 7 10-05-2019 01:03 PM
Dictionary plugin in Sigil? For example Oxford-English Dictionary. Rindr Plugins 2 03-04-2018 11:11 AM
PRS-600 Dictionary not working after firmware update pakiyabhai Sony Reader 1 10-24-2009 09:02 PM
Update Problem and Dictionary Question barryp Sony Reader 8 09-22-2008 05:56 AM


All times are GMT -4. The time now is 03:33 AM.


MobileRead.com is a privately owned, operated and funded community.