Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 07-09-2019, 03:47 PM   #16
Doitsu
Wizard
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 4,620
Karma: 14578553
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by KevinH View Post
This process seems to have been lost over the years as people do not understand the affix rules and affix compression.
IMHO, the main problem is that there aren't any user-friendy tools for editing/generating dictionary and suffix files.
Doitsu is offline   Reply With Quote
Old 07-09-2019, 05:48 PM   #17
BetterRed
null operator
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 14,065
Karma: 11423372
Join Date: Mar 2012
Location: Sydney Australia
Device: none
↑ ↑ ↑ ✔️

Several years ago I tried, and failed, to edit the Kracked Press en GB hunspell dictionary. I also tried and failed to create a domain specific dictionary. I was surprised there were no tools specific to the task - no demand I guess.

Today, I could possibly create an epub from scratch with notepad and pkzip - but only because of what I've learnt from using Sigil On reflection that's a big 'possibly', if they were all I had.

BR

Last edited by BetterRed; 07-09-2019 at 05:50 PM.
BetterRed is offline   Reply With Quote
Advert
Old 07-09-2019, 10:20 PM   #18
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 3,567
Karma: 2200024
Join Date: Nov 2009
Device: many
Unfortunately MySpell 2 or 3 had both munch and unmunch tools that worked for the dictionaries used at that time (including en, german, french, spanish, etc) but Hunspell needed compound prefixes, compound suffixes, and compound words to handle Hungarian and other languages. The standard munch and unmunch tools were never really modified for those changes and nothing was ever documented.

MySpell dictionaries still work in Hunspell and work for most western languages. I can probably dig up a copy of MySpell-3 source someplace and walk anyone through it.

Last edited by KevinH; 07-09-2019 at 10:56 PM.
KevinH is online now   Reply With Quote
Old 07-10-2019, 11:41 AM   #19
elchamaco
Zealot
elchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enough
 
Posts: 108
Karma: 500
Join Date: Aug 2011
Device: kindle, boox
Quote:
Originally Posted by KevinH View Post
Please note for Hunspell dictionaries that properly use affix detection and compression, you should not add unflagged words to the dictionary. The proper way to handle that for en is to expand the dictionary (by reversing affix flag usage) to recreate a plain word list, add you new words and be sure to add all versions of the word with prefixes and suffixes, and then re-crunch the wordlist.

This process seems to have been lost over the years as people do not understand the affix rules and affix compression.

For example the en US dict that Sigil used to use had no affix compression used at all. Being the original author of MySpell (predecessor of hunspell) and one-time head of OpenOffice's lingucomponent project, it is sad to see information on how to properly create dictionaries that are not giant wordlists has been lost.

In addition, the role of a spellcheck dictionary is not the same as an online dictionary or real dictionary. Spellcheck dictionaries should be designed to focus on the "working set" of a language and NOT try to be all encompassing as this actually leads to fewer incorrect words being detected as common mistakes turn out to be real but not typically used words, or slang, or abbreviations, or whatnot.

You are better off creating additional user dictionaries that catch common words you use that are not covered by the spellcheck dictionaries, to expand your personal "working set" of the language.

Some time ago i created a spanish hunspell spanih dict, i needed to dig to create a good one, now it's used with sigil by a lot of people. Now the idea is to improve it.

Also I want improve a real dict with definitions.

It's hard to find documentation about dictionaries, or a good program to edit them and export to differente formats.
elchamaco is offline   Reply With Quote
Old 07-10-2019, 12:00 PM   #20
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 3,567
Karma: 2200024
Join Date: Nov 2009
Device: many
I will grab a copy of the spanish hunspell dictionary and take a look to see what features are being used. If they stick to things that MySpell groks, we can use the MySpell tools to expand the spanish dictionary and then remunch it for use in hunspell. If it uses any of the newer Hunspell features, the older munch and unmunch tools will not be of any help.

KevinH




Quote:
Originally Posted by elchamaco View Post
Some time ago i created a spanish hunspell spanih dict, i needed to dig to create a good one, now it's used with sigil by a lot of people. Now the idea is to improve it.

Also I want improve a real dict with definitions.

It's hard to find documentation about dictionaries, or a good program to edit them and export to differente formats.
KevinH is online now   Reply With Quote
Advert
Old 07-10-2019, 12:54 PM   #21
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 3,567
Karma: 2200024
Join Date: Nov 2009
Device: many
Okay, the version shipped inside Sigil on Windows and Mac of the spanish dictionary is a straight MySpell level dictionary and as such the munch and unmunch tools will work.

I found an old copy of MySpell-3 stored on a google code archive and was able to easily build and run it on my Mac. This included munch and unmunch tools as well.

So with unmunch, I can take the es.aff (which describes prefixes and suffixes commonly used in Spanish along with the rules when they apply) and the es.dic files and create one long universal list of words recognized in all of its forms.

You can then add lots of new words. Or even create a new Prefixes or Suffixes flag if you know which ones might be missing and the rules for applying them.

Once we have that we can run munch to create the new .dic file. We can also add charmaps and replacement tables along with phonetic sound alike rules to help improve the suggestions generated.

So if this is something you would like to do, I would be happy to help. Once you get into Hunspell only features, then munch and unmunch will no longer work and you are on your own so to speak.
KevinH is online now   Reply With Quote
Old 07-10-2019, 01:31 PM   #22
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 3,567
Karma: 2200024
Join Date: Nov 2009
Device: many
Just for laughs, I ran unmunch on the en_US.dic and en_US.aff file and the 62,074 base words with affix flags expanded to a word list of 152,469 unique words.

I tried the same thing for es.dic and es.aff and the 58,154 base words with affix flags expanded to a word list of 689,751 unique words.

So Spanish must make use of prefixes and suffixes much more than English!

Also, if you lookat the working set vocabulary used by Shakespeare for example, it was something like 35,000 words. Most average people have working sets of 10,000 to 20,000 words.

Any way you look at it having 689751 unique words seems to be huge coverage.

Has anyone validated the universe of words the Spanish dictionary already covers?
KevinH is online now   Reply With Quote
Old 07-11-2019, 12:34 PM   #23
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 3,567
Karma: 2200024
Join Date: Nov 2009
Device: many
@elchamaco
If I were to zip up the unmunched spanish wordlist and post it here would you be willing to download the wordlist and look at it to see if it at all makes sense. Having over 600,000
unique letter combinations that a spellcheck dictionary would deem correct for a wordlist just seems too big to be true without compound words.

Thanks,

KevinH
KevinH is online now   Reply With Quote
Old 07-18-2019, 04:01 AM   #24
elchamaco
Zealot
elchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enoughelchamaco will become famous soon enough
 
Posts: 108
Karma: 500
Join Date: Aug 2011
Device: kindle, boox
The one i created was near 1 million words the base (980-990), 234k the muched list. I used the aff from libreoffice spanish if i remember well.
elchamaco is offline   Reply With Quote
Old 07-18-2019, 10:22 AM   #25
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 3,567
Karma: 2200024
Join Date: Nov 2009
Device: many
The problem is more words do not make a spellcheck dictionary necessarily better (unlike an online dictionary).

As I tried to explain earlier, a spellcheck dictionary is meant to cover the "working set" of a language. It is not meant to be exhaustive such as an online or paper copy dictionary would attempt to be.

The reason is that many times common mistakes and typos turn out to be actual but very infrequently used "words" and not what the author intended. It also results in words being suggested for replacement that the author would never use. Both lower the effectiveness of the spellchecker.

The idea is that more rarely used or more esoteric words can and should be looked up in online dictionaries.

One of the nice features of spellcheck dictionaries is that authors can add their own list of more unique words that they actually use to augment the "working set" making the spellcheck function fine tuned that that particular person and their writing.

That was and continues to be the concept behind the design of spell check dictionaries.

Hope something here helps.
KevinH is online now   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Bug] Spellcheck List Cursor Location Tex2002ans Sigil 8 08-06-2018 10:53 AM
Export words from Pocketbook superpawko PocketBook 4 12-27-2017 04:06 PM
Spellcheck Ignore Words tetrault Sigil 4 02-11-2017 03:25 PM
Spellcheck in book view + selected text spellcheck unfairrobot Sigil 2 12-19-2016 04:50 PM
Unable to use spellcheck dictionary for italicizing words sjhawar Sigil 18 10-20-2016 03:01 PM


All times are GMT -4. The time now is 02:46 PM.


MobileRead.com is a privately owned, operated and funded community.