Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 01-21-2022, 04:09 AM   #31
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
Quote:
Originally Posted by Tex2002ans View Post
Marco Pinto is the one who takes care of most en_GB lists nowadays:

https://github.com/marcoagpinto/aoo-mozilla-en-dict

From a quick look at his dictionary though, he also tends towards including nearly every word under the sun.

He also seems to be releasing monthly updates. (Compared to SCOWL's much slower, but thoroughly vetted releases.)

Another nice thing is his changelogs show exactly which words were added when:

https://raw.githubusercontent.com/ma...LO_2013%2B.txt


Definitely report many of those errors to Marco's github and get those fixed!
Going through that dictionary pulling out the errors will probably keep me busy for the rest of my life.
Ashjuk is offline   Reply With Quote
Old 01-21-2022, 10:38 AM   #32
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
@Ashjuk,
Does the current Sigil en_GB dictioanry support "ise" or "ize" or both? I will do what the current Sigil en_GB dictionary does in that regard following the rule of least surprise.

Thanks!
KevinH is offline   Reply With Quote
Old 01-21-2022, 12:08 PM   #33
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
Kevin,

As far as I am aware 'ise' is the default for the current en_GB dictionary. When I do a spellcheck if the book is set to English - Great Britain in the metadata 'ize' is normally picked up as misspelled.

I have now checked my UK list of words against the Google GB dictionary and have uploaded a new file to my Google drive of those that are still missing. I have also uploaded a complete list of the words in the Google file in alphabetical order as plain text if that is of any use.
Checked file - https://drive.google.com/file/d/18C8...ew?usp=sharing
Full list - https://drive.google.com/file/d/1bmK...ew?usp=sharing
Ashjuk is offline   Reply With Quote
Old 01-21-2022, 12:13 PM   #34
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
In your opinion, is the Google en_GB dictionary suitable as a starting point for Sigil (unlike the libreoffice/openoffice ones)?
KevinH is offline   Reply With Quote
Old 01-21-2022, 12:43 PM   #35
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
From what I have seen of it I would tentatively say yes. I would hazard a guess that less than 5% of the words it contains were flagged as misspelled by Word, and probably a lot of those are OK being real names and new words.

There are a few problems that I spotted early on that could possibly be addressed. One being the inclusion of Gray. Whilst this is probably meant to be a person's name it is also the US spelling of grey.
So if one were to start a sentence with the words "Gray clouds covered the sky" and what you meant to write was "Grey clouds covered the sky" it would not be flagged as misspelt.

Also there is one huge error - Scotchman/Scotchwoman. Scotch is a drink! If you were to call a Scotsman a Scotchman I doubt you would be standing long.

Hopefully I can find the time to have a better look to see if I can spot any other words that might cause a problem.
Ashjuk is offline   Reply With Quote
Old 01-21-2022, 01:07 PM   #36
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 34,557
Karma: 144552660
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by Ashjuk View Post
Also there is one huge error - Scotchman/Scotchwoman. Scotch is a drink! If you were to call a Scotsman a Scotchman I doubt you would be standing long.
Actually, scotchman is correct but it's an old variant of scotsman or scot and can be considered insulting today. Does the dictionary you were looking at also mention the nautical use for scotchman?
DNSB is offline   Reply With Quote
Old 01-21-2022, 03:02 PM   #37
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
It seems starting with old dictionaries is fraught with danger one way or the other. Based on all of this, and based on scowl being the only one being vetted in a consistent manner, and based on Tex2002ans's comments, I think we should probably stick with scowl plus some obvious additions.

So I am going to create dictionaries based on scowl 60 and 70 with proper accents, with the addition of the checked words Ashjuk found specific to US and UK, and with the new words as well.

Once I have those I can post them here and people can evaluate them.

If they appear to be a clear improvement over what we have now, I will push them to master.

How does that sound to everyone?

Last edited by KevinH; 01-21-2022 at 05:08 PM.
KevinH is offline   Reply With Quote
Old 01-21-2022, 04:25 PM   #38
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
new en_* Test Dictionaries

Hi All,

Attached to this post are two zip archives which contain the latest scowl based "en" hunspell dictionaries that have been extended to cover both some verified words and some common proper company product names (iPhone, etc).

The en_scowl_size_60.zip has the en_US, en_CA, en_AU, en_GB, and en_GB-oed .aff and .dic files based on scowl size 60.

Similarly the en_scowl_size_70.zip has the en_US, en_CA, en_AU, en_GB and en_GB-oed .aff and .dic files.

If you have a chance please give these a try and let me know of any issues you run into.

Special thanks to Ashjuk for checking so many words for both GB and US dictionaries and posting them so we could improve our internal Sigil hunspell dictionaries.

Here are the number of "root word" entries for all of these dictionaries and "total words" covered when counting every unique string.

size_60
--------
Code:
 - en_AU:        51106, 125043 + 78 no suggest words
 - en_CA:        50999, 124839 + 78 no suggest words
 - en_GB-oed:    50930, 124368 + 78 no suggest words
 - en_GB:        51527, 125264 + 78 no suggest words
 - en_US:        51412, 125475 + 78 no suggest words
size_70
--------
Code:
 - en_AU:        81065, 168300 + 78 no suggest words
 - en_CA:        80888, 168061 + 78 no suggest words
 - en_GB-oed:    80752, 167543 + 78 no suggest words
 - en_GB:        81159, 168128 + 78 no suggest words
 - en_US:        81121, 168592 + 78 no suggest words
To test any dictionary after unzipping, copy the .dic and its matching .aff file from the unzipped folder into your Sigil Preferences folder to inside the existing "hunspell_dictionaries" folder found there.

After restarting Sigil, the dictionaries there will take precedence over the Sigil installed with the same name until you delete them.

Edit: Removed the now outdated zipped dictioanries. See later posts in this thread for updated versions.

Last edited by KevinH; 01-24-2022 at 11:40 AM. Reason: remove now outdated zipped dictionaries
KevinH is offline   Reply With Quote
Old 01-22-2022, 04:13 AM   #39
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
Quote:
Originally Posted by DNSB View Post
Actually, scotchman is correct but it's an old variant of scotsman or scot and can be considered insulting today. Does the dictionary you were looking at also mention the nautical use for scotchman?
Actually I did not look it up - it's just something we are aware of here in the UK.

Whilst you are correct in saying that it an historical name for Scots it really should not be used these days. Having lived in Scotland for a while I can assure they get extremely offended if you refer to them as Scotchmen.

I checked on scotchman and found the nautical reference you mentioned. So perhaps scotchman should be included for that reason, but the Google dictionary had it listed as:

Scotchman
Scotchmen
Scotchwoman
Scotchwomen

So I assume it is referring to the race and not the nautical use.
Ashjuk is offline   Reply With Quote
Old 01-22-2022, 04:28 AM   #40
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
@Kevin - Thank you too for all your hard work.

The LibreOffice dictionary is, in my opinion, a complete mess, and the Google one would require a good deal of checking. So basing Sigil's dictionaries on a known vetted source (scowl) is probably the best way forward.

I will test out the new dictionaries and report back if I discover any issues.

Perhaps we ought to have an annual review where everyone submits a list of verified new words from their user dictionary for inclusion in the next release.
Ashjuk is offline   Reply With Quote
Old 01-22-2022, 08:52 AM   #41
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
FWIW, we can also remove words from the dictionary or keep them in the dictionary but mark them as "no suggest" if they are now considered offensive.

Everything I have looked at says Scotchman and Scotchwoman (and their variations) are at their worst offensive and at their best obsolete.

Unfortunately, they are part of the current scowl wordlists. In fact I think the google dictionaries are probably scowl based. Perhaps someone should open a bug report on the scowl github site and suggest the removal of that word, or raising it to level 80 (lower frequency) so it is no longer part of most spelling dictionaries.

Given the word is obsolete at best, perhaps we should remove it or at least set it as no suggest in our dictionaries before their release?

All thoughts welcome.

Last edited by KevinH; 01-22-2022 at 10:45 AM.
KevinH is offline   Reply With Quote
Old 01-22-2022, 03:53 PM   #42
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,459
Karma: 26645808
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by KevinH View Post
FWIW, we can also remove words from the dictionary or keep them in the dictionary but mark them as "no suggest" if they are now considered offensive.

Everything I have looked at says Scotchman and Scotchwoman (and their variations) are at their worst offensive and at their best obsolete.

Unfortunately, they are part of the current scowl wordlists. In fact I think the google dictionaries are probably scowl based. Perhaps someone should open a bug report on the scowl github site and suggest the removal of that word, or raising it to level 80 (lower frequency) so it is no longer part of most spelling dictionaries.

Given the word is obsolete at best, perhaps we should remove it or at least set it as no suggest in our dictionaries before their release?

All thoughts welcome.
And well as a wrapper for a shroud**, a Scotchman is also a Pacific Ocean game fish.

Can the end-user set/unset a word to 'no suggest', if so how?

Then those who write about cricket for public broadcasters etc could mark 'batsman/men' as 'no suggest' and use 'batter' instead.

And thanks a lot for the OED spelling dictionary.

** a shroud is part of the standing rigging that holds a mast aloft.

BR

Last edited by BetterRed; 01-22-2022 at 04:44 PM. Reason: define shroud
BetterRed is offline   Reply With Quote
Old 01-22-2022, 06:27 PM   #43
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
There is no easy way for the user to mark something as no suggest.

The only way to do it is to open the .dic file in an editor that accepts utf-8 text with no carriage returns (unix line ends) and add an ! mark to the existing flags for that root word or add /! if no flags exist.

That approach will only work with these dictionaries as ! is set as the no suggest flag.
KevinH is offline   Reply With Quote
Old 01-23-2022, 03:58 AM   #44
Ashjuk
Fanatic
Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.Ashjuk ought to be getting tired of karma fortunes by now.
 
Ashjuk's Avatar
 
Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
Given that scotchman has meanings other than that of referring to a male of Scots origin perhaps it would be better to leave that in the dictionary. I doubt it will be encountered much, if at all, so I don't think it's going to be an issue.

As for Scotchmen/Scotchwoman/Scotchwomen. Personally I think they could be removed.
Ashjuk is offline   Reply With Quote
Old 01-23-2022, 10:44 AM   #45
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
A spellchecker dictionary is very different from a regular dictionary. Its role is to help catch common spelling errors, not to define a language or a word. Therefore keeping obsolete or very rarely used words in a spellchecker dictionary is just not appropriate given they help hide spelling errors on more commonly used words.

I will remove both Scotchman, and Scotchwoman and their variants but leave scotchman in the final release. If people have a historical text that uses those words that they do not want to update to their modern equivalents, they can easily ignore those words or simply add them to their User dictionary.

This is the same reason a spellchecker dictionary should not be based on scowl size 80 or larger (and many say 70 or larger).

Unfortunately, vetting the scowl word lists really requires a team of dedicated people not just one or two.

Thanks,

KevinH




Quote:
Originally Posted by Ashjuk View Post
Given that scotchman has meanings other than that of referring to a male of Scots origin perhaps it would be better to leave that in the dictionary. I doubt it will be encountered much, if at all, so I don't think it's going to be an issue.

As for Scotchmen/Scotchwoman/Scotchwomen. Personally I think they could be removed.
KevinH is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sigil newbie dictionary questions michaelbr Sigil 8 12-06-2020 09:41 AM
Content Dictionary update availability ntamas Amazon Kindle 7 10-05-2019 01:03 PM
Dictionary plugin in Sigil? For example Oxford-English Dictionary. Rindr Plugins 2 03-04-2018 11:11 AM
PRS-600 Dictionary not working after firmware update pakiyabhai Sony Reader 1 10-24-2009 09:02 PM
Update Problem and Dictionary Question barryp Sony Reader 8 09-22-2008 05:56 AM


All times are GMT -4. The time now is 06:01 AM.


MobileRead.com is a privately owned, operated and funded community.