![]() |
#16 | |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
@Ashjuk,
Thank you! I will grab it. I am thinking of creating an all inclusive dictionary that combines good words from both hunspell and myspell and then one based purely on scowl. @Tex200ans, BTW, I am a lot less worried than KevinA about adding too many proper names as they must begin with an initial uppercase letter where as both cases are allowed if the base root word is in lower case unless flagged otherwise. So the chance of randomly hiding a more common word is much lower in the general case. It is definitely a trade-off. I will try to get something together this week or next. KevinH Quote:
|
|
![]() |
![]() |
![]() |
#17 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
I have built a tentative new en_US .dic and .aff file that is meant to rival what Pages and Word accepts by merging the unmunched hunspell dictionary with the unmunched MySpell dictionary with Ashjuk's new additions for en_US.
I will then build various scowl based (60, 70, 80) wordlists and we can compare them. I must say KevinA's scowl repo build process is not well designed to say the least. It uses symlinks everywhere which is a major no-no and then strips out all accents so that he can just rename the .aff and .dic to utf-8 when it really is latin-1 based and the accent characters could have been properly converted and kept. I do not want an "eclair" in the wordlist! So even scowl has its drawbacks. A 6 line python program could have done the conversions from one encoding to another, He has to keep the latin-1 encoded files for munch and unmunch to work (as it needs 1 byte = 1 char rule for munch speed). This was the design to reduce dictionary sizes as many languages are based around one 8-bit encoding, latin-1, latin-2, etc. Last edited by KevinH; 01-20-2022 at 12:00 PM. |
![]() |
![]() |
Advert | |
|
![]() |
#18 | |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
Hi Ashjuk,
For the new en_GB dictionary, I should add the short list of new words you posted earlier to it, correct? KevinH Quote:
|
|
![]() |
![]() |
![]() |
#19 | |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Quote:
The list of new words could probably be added to both the US and UK dictionaries. They were 'created' words that I felt have come into common usage in recent times. Photoshopped, for example, is regularly used to refer to an image that has been manipulated regardless of whether Photoshop was used or not. |
|
![]() |
![]() |
![]() |
#20 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
FWIW, at the LibreOffice site they have a link to all of the en_* dictionaries in one .oxt (libreoffice extension, but really just a zip". That en_GB dictionary has over 90k root words in it. It mght be worth testing your checked UK words and new words list with it just to see how well it performs. Our cureent en_GB dictionary has only 30k root words where as most en_US have over 50k to 60k root words. So our en_GB dictionary may not be a good starting point at all.
|
![]() |
![]() |
Advert | |
|
![]() |
#21 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Sounds like a good idea. With 90k root words in it I suspect a lot of the words in my list may well be present there.
Is it possible to use the the LibreOffice dictionaries in Sigil? |
![]() |
![]() |
![]() |
#22 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
Yes,
The download link (top one) from here: https://extensions.libreoffice.org/e...h-dictionaries You can rename it to end with .zip and just unzip it. In your Sigil Preferences folder you can put the en_GB.aff and en_GB.dic inside the existing hunspell_dictionaries folder and it should take precedence over the Sigil installed one. You will probably need to restart Sigil so it finds your new dictionary and puts it first in the list. |
![]() |
![]() |
![]() |
#23 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
I tried that on my machine with the new tentative en_US dictionary with your new words and the following line spellchecked as all correct:
Code:
<p>I bought a new iPhone and I use it to access Facebook all of the time.</p> |
![]() |
![]() |
![]() |
#24 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
I downloaded and unzipped the dictionary files from LibreOffice.
Having had a quick look at the en.GB file in Word I am shocked. I would hazard a guess that nearly 50% of them Word classifies as misspelled, and from just the 'A's alone there are so many glaring errors that I really don't think it is fit for purpose. I kid you not there are entries like annefrank which surely should be Anne Frank. It also looks like someone has just thrown a UK Gazetteer at it too as it's littered with obscure place names, some of which are clearly wrong. I came across Balcombe-Horley as an entry. As someone who lives near Balcombe and Horley I can assure you there is no such place as Balcombe-Horley, they are two separate towns about 10 miles apart from one another. I've no idea who edits the LibreOffice dictionaries but if this is an example of what they are like then I would not advise anyone to use them. |
![]() |
![]() |
![]() |
#25 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
Wow! That is good to know. I will *not use* the libreOffice (openoffice) latest english GB dictionary as a starting point. I will instead use Sigil's en_GB, augment it with scowl and add in from there.
When I get something useful, I will let you know for testing purposes. Thanks! Last edited by KevinH; 01-20-2022 at 12:08 PM. |
![]() |
![]() |
![]() |
#26 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Yes, definitely start with with you have now.
Merging the LibreOffice files would, in my opinion, be a big mistake given the number of errors I have come across having only just started to look at the en.GB file. |
![]() |
![]() |
![]() |
#27 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
Okay, here is one other official source of spellchecking. This one is used by Google itself for its chromium project.
https://chromium.googlesource.com/ch...s/heads/master You can grab the en_GB.aff, and the en_GB.dic and there is even a delta (extra words file) called en_GB.dic_delta (extra words to add to the official dictionary). I checked the sizes and the en_CA, en_US, and en_GB all have about 50,000 root words and all use pretty much the same .aff file with slight differences. These word lists seem much better in root word size. So if you get a chance please try out that en_GB set and see if it would be a better starting point than our older en_GB. It is also interesting to see look at the words in the .dic_delta file to see recent additions that are not in the main dictionary. Please let me know what you think. Thanks Update: See google version of en_GB zipped up attached for ease of access Last edited by KevinH; 01-20-2022 at 02:18 PM. |
![]() |
![]() |
![]() |
#28 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5703082
Join Date: Nov 2009
Device: many
|
@Ashjuk,
One more question. It seems the en_GB can be built to support "ise" endings (ala The Times), or "ize" endings (ala the OED), or both. Which would be best for general purpose use in Sigil? |
![]() |
![]() |
![]() |
#29 | ||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Marco Pinto is the one who takes care of most en_GB lists nowadays:
https://github.com/marcoagpinto/aoo-mozilla-en-dict From a quick look at his dictionary though, he also tends towards including nearly every word under the sun. He also seems to be releasing monthly updates. (Compared to SCOWL's much slower, but thoroughly vetted releases.) Another nice thing is his changelogs show exactly which words were added when: https://raw.githubusercontent.com/ma...LO_2013%2B.txt Quote:
See:
especially the specs for: This was out of BCP47: Quote:
Quote:
- - - From everything that I can recall, what typically happens across programs/apps is... When you select your language, you'd have the big 2 choices: 1. English (American) 2. English (British) -- -ise Beyond that point, programs might include many of the main variants (Australian, Canadian, etc.). ... and then (very rarely included by default): - English (Oxford/OED) -- -ize - - - LibreOffice has theirs listed as:
- - - Word 2016 only has:
No Oxford by default. (No clue if this has changed in newer versions. I believe if you wanted Oxford dict, you'd have to grab third party dictionaries.) From a quick test, it looks like "British" Word may accept all -ise + -ize endings. (But I think that's a poor idea. Again, see SCOWL with popularity+usage+levels-of-accepted-variants.) - - - Antidote, when you're selecting between English, gives 4 options:
Note: I agree strongly with this separation. When trying to spellcheck/proof actual texts, books typically stick with a single spelling variant throughout (based on author/publisher location + Style Guide). Mashing all endings together will cause you to MISS inconsistencies within a single text. So Sigil, if deciding to go with the big 2 + Oxford, should:
Quote:
And that's also why it's important to... thoroughly double-check against real-life popularity/usage. ![]() (Not like that guy in the Reddit post who said "Why not just accept everything from Wiktionary?" !!!) Definitely report many of those errors to Marco's github and get those fixed! Last edited by Tex2002ans; 01-20-2022 at 10:11 PM. Reason: Whoops, accidentally posted a WIP smaller version. |
||||
![]() |
![]() |
![]() |
#30 | |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 500
Karma: 3498633
Join Date: May 2011
Location: Surrey, UK
Device: Kobo Aura One, Sony PRS 600/650
|
Quote:
As for the -ise vs -ize debate I have always used -ise. This probably because (being nearly 72) that is the way I was taught. Saying that I do understand that -ize is slowly creeping into common usage. This probably due to the fact that youngsters here in the UK nowadays are subjected to far more American content than my generation was at that age. That is one thing I noted about the LibreOffice UK dictionary, there appeared to be both versions included. Personally I would prefer Sigil to stick with the -ise convention, but I would understand if others would prefer to have both s and z. |
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil newbie dictionary questions | michaelbr | Sigil | 8 | 12-06-2020 09:41 AM |
Content Dictionary update availability | ntamas | Amazon Kindle | 7 | 10-05-2019 01:03 PM |
Dictionary plugin in Sigil? For example Oxford-English Dictionary. | Rindr | Plugins | 2 | 03-04-2018 11:11 AM |
PRS-600 Dictionary not working after firmware update | pakiyabhai | Sony Reader | 1 | 10-24-2009 09:02 PM |
Update Problem and Dictionary Question | barryp | Sony Reader | 8 | 09-22-2008 05:56 AM |