Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 05-26-2016, 03:11 PM   #1
jcsalomon
Zealot
jcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheese
 
jcsalomon's Avatar
 
Posts: 100
Karma: 1204
Join Date: Jun 2012
Device: Bookari (née Mantano Reader) on Android; Kindle Fire HD
Spellcheck and apostrophes

This follows up on this thread from January 2014 and this thread from February 2015. (Yes, this means I’m two months late with this thread.)

As the other folk have noted, Sigil’s spellcheck has trouble with words with Unicode curly apostrophes (at least on Windows). I’ve found a partial work-around: when I want to add a word to a personal dictionary, I first replace the curly apostrophe with a straight one—this lets Sigil add the word to the dictionary, and thereafter it will not flag the word even with a Unicode apostrophe. To those familiar with the spellcheck-interface code, this could probably suffice to point to a patch: when adding a word to the dictionary, replace curly apostrophes with the straight ones.
jcsalomon is offline   Reply With Quote
Old 05-26-2016, 04:17 PM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
That is strange. Are you sure your smart quotes are truly unicode smart quotes, and not microsoft cp smart quotes or latin-1 smart quotes? And what dictionary are you testing this with? Is it the en_US dictionary? Or one of your on wordlist dictionaries.

The Hunspell en_US dictionary we use always converts words with all smart quotes to dumb ones, before attempting to check the word, as this is how the words in the en_US dictionary are stored. So if you add your owns words to a wordlist, you should be doing the same.

Also what version of Sigil are you using?

And are the epub's xhtml files properly utf-8 encoded? (if you input from Windows text files this is rarely the case as Windows is stuck with legacy codepage text encoding instead of straight utf-8).
KevinH is online now   Reply With Quote
Advert
Old 05-26-2016, 04:53 PM   #3
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
I just checked and if you hit ignore word and that word uses the curly version of the apostrophe, it is automatically converted to the dumb version before being added to the Sigil dictionary.

That said, user wordlists are always utf-8 encoded so non-dumb apostrophes will be left alone there. I will look into if that should be changed and if so how.
KevinH is online now   Reply With Quote
Old 05-29-2016, 03:48 AM   #4
jcsalomon
Zealot
jcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheese
 
jcsalomon's Avatar
 
Posts: 100
Karma: 1204
Join Date: Jun 2012
Device: Bookari (née Mantano Reader) on Android; Kindle Fire HD
Most recent version, on Windows; not sure about Ignore, but Add Word did not work for words with curly apostrophes in UTF-8 encoded files. (Unlikely that this would matter, but I have two user dictionaries active.)
jcsalomon is offline   Reply With Quote
Old 06-01-2016, 09:23 PM   #5
jcsalomon
Zealot
jcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheesejcsalomon can extract oil from cheese
 
jcsalomon's Avatar
 
Posts: 100
Karma: 1204
Join Date: Jun 2012
Device: Bookari (née Mantano Reader) on Android; Kindle Fire HD
Confirming: the files are UTF8-encoded, and neither Ignore nor Add to Dictionary work for words with U+2019 (’) curly apostrophes. Sigil 0.9.5 64-bit on Windows 10.
jcsalomon is offline   Reply With Quote
Advert
Old 01-17-2017, 08:40 AM   #6
AnselmD
Zealot
AnselmD began at the beginning.
 
Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
I am working with Sigil 0.9.7 Windows 32bit.

As i can see, the words are add to the dictionary, but are still recognized as misspelled in the text.

This (’) is the German Auslassungszeichen, so i do not like to change it to a (').

Apostroph – Wikipedia
https://de.wikipedia.org/wiki/Apostr...afisch_korrekt
AnselmD is offline   Reply With Quote
Old 01-17-2017, 11:36 AM   #7
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Have you set the proper language code for the document? Have you installed the German Hunspell dictionary and if so which "wordchars" are set for that language.

I could never recreate the issue the original poster had when I used a correct dictionary and set the language correctly.

Please verify.

Also please create a small public domain epub testcase that shows the error and attach it here.

KevinH
KevinH is online now   Reply With Quote
Old 01-17-2017, 04:30 PM   #8
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
@AnselmD: The German dictionaries that Sigil is bundled with (de_De) are optimized for Windows and might cause problems with some characters such as curly apostrophes.

If you didn't install a custom German dictionary, try the following:

1. Locate your Sigil dictionary folder. (Default: C:\Program Files\Sigil\hunspell_dictionaries).

2. Make a backup copy of de_De.dic and de_De.aff and replace them with the files in this attachment.

3. Retest the spellcheck of the words that you had problems with.

If replacing the dictionary files doesn't solve your problem, please attach your custom dictionary and give some specific examples of words that you have problems with.
(To locate the custom dictionaries, select Edit > Preferences > Open Preferences Location > user_dictionaries.)
Doitsu is offline   Reply With Quote
Old 01-17-2017, 05:35 PM   #9
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
There is this piece of code in Sigil:
Code:
QString Utility::getSpellingSafeText(const QString &raw_text)
{
    // There is currently a problem with Hunspell if we attempt to pass
    // words with smart apostrophes from the CodeView encoding.
    // There are likely better ways to solve this, but this one does
    // get the job done until someone can implement something better.
    QString text(raw_text);
    return text.replace(QString::fromUtf8("\u2019"), "'");
}
Don't know if it's related, it is used three times in SpellCheck.cpp. Just info.
varlog is offline   Reply With Quote
Old 01-17-2017, 06:27 PM   #10
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Yes and it is there to prevent dictionaries from having to have duplicate entries for every valid word that uses an apostrophe (one for dumb and one for smart). It use causes problems but ....

We think another possible cause of the bug is because user dictionaries and ignored words are added to the hunspell dictionary in the encoding specified by the dictionary .aff SET line. Unfortunately the rsquo char is technically not part of the ISO8859-1 encoding used by the German dictionary and when encoded to match that encoding on some platforms results with a missing character '?' marker.

Thus Doitsu's modified German Dictionary to see if that issue is affected.

Thanks,

KevinH
KevinH is online now   Reply With Quote
Old 01-17-2017, 07:48 PM   #11
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,572
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by KevinH View Post
The Hunspell en_US dictionary we use always converts words with all smart quotes to dumb ones, before attempting to check the word, as this is how the words in the en_US dictionary are stored.
Who are 'we' - or are you the claiming the Royal prerogative

Quote:
Originally Posted by KevinH View Post
I could never recreate the issue the original poster had when I used a correct dictionary and set the language correctly.
The 'correct dictionary' is more likely to be a function of where the book was published, not the language in which it is written .

Assuming a book is written in English then: if it was published in the UK I use a British dictionary, if it was published in the US I use a US dictionary, if it's an Australian book I use an Australian dictionary. If it was published elsewhere and I don't have a country specific English dictionary I use the British English dictionary.

If I discover that the book, whilst being published (reprinted) outside the US, is using US spelling I switch to the US dictionary. And vice versa, yes it happens, particularly if the book was originally written in a language other than English (yes, that happens too), translated in the UK, and republished (reprinted) in the US.

This is not a situation peculiar to English, other languages also have country specific dictionaries, e.g. French (France, Belgium, Canada etc), Spanish (Spain, Argentina, Mexico etc), Arabic (Morocco, Egypt, Syria etc), even Nepali (Nepal and India), and Chinese of course!

I've learnt to live with hunspell's inability to deal with apostrophes in English. I'm saying it's a hunspell deficiency because I have two other hunspell based spell checkers that exhibit the same misbehaviour.

If there's a workaround that is only applied when a specific US English Dictionary is used, then perhaps it could be be made country and dictionary source agnostic, most English variants have similar rules regarding the use of apostrophes. Except maybe their use in possessives ending with 'ess', 'ecs' and 'zed/zee' <sigh>

BR'

Last edited by BetterRed; 01-17-2017 at 09:17 PM.
BetterRed is offline   Reply With Quote
Old 01-17-2017, 09:40 PM   #12
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
The "we" here is actually just "me" as I am the one who designed and created the MySpell spell checker that later became the basis for hunspell. At its root, a dictionary is simply a list of all of the words in the working set of a language. To make the list workable in size you end up needing prefix and suffix compression as well as limiting the set of usable letters used in the dictionary. It makes no sense to include every single possessive word in the wordlist twice. So dictionaries standardized on using a normal apostrophe in the wordlist. To spellcheck a word, you make a copy of it and replace any fancy single quotes with an apostrophe so it can be checked efficiently in the dictionary word list. I also produced the first en_US dictionary for MySpell from established wordlists to make that rule work. Numerous others dictionaries have followed that rule. And hunspell inherited that behaviour from MySpell.

It really isn't much of a limitation for efficient dictionary wordlist lookup provided suggestions as simple re or find and replace can convert any punctuation and apostrophies to their smart equivalent easily if that us simething the end user wants.

As for "correct" dictionaries, in this case it means the user has chosen a dictionary that is encoded in in a charset that actually supports the characters he or she uses in the language. The ISO-8859-1 charset does not actually have a smart single quote in it and that is the encoding the de_DE dictionary uses.

KevinH

Last edited by KevinH; 01-17-2017 at 09:48 PM.
KevinH is online now   Reply With Quote
Old 01-19-2017, 03:30 PM   #13
AnselmD
Zealot
AnselmD began at the beginning.
 
Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
Quote:
Originally Posted by KevinH View Post
Have you set the proper language code for the document? Have you installed the German Hunspell dictionary and if so
Hi Kevin,

I installed this dictionary:
German (de-DE-1901) old spelling dictionaries-2016.04.03 | Apache OpenOffice Extensions
http://extensions.openoffice.org/de/...aries-20160403

C:\Users\xyz\AppData\Local\sigil-ebook\sigil\hunspell_dictionaries

03.04.2016 16:00 33.103 de_DE_OLDSPELL.aff
03.04.2016 16:00 4.253.179 de_DE_OLDSPELL.dic
03.04.2016 16:00 82.926 hyph_de_DE_OLDSPELL.dic

Preferences->Spellcheck Dictionaries->Dictionary: de_DE_OLDSPELL
Quote:
Originally Posted by KevinH View Post
which "wordchars" are set for that language.
What do you mean with this?
AnselmD is offline   Reply With Quote
Old 01-19-2017, 03:40 PM   #14
AnselmD
Zealot
AnselmD began at the beginning.
 
Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
Quote:
Originally Posted by KevinH View Post
Also please create a small public domain epub testcase that shows the error and attach it here.
KevinH
ok, I attached it
Attached Files
File Type: epub testcase.epub (1.8 KB, 285 views)
AnselmD is offline   Reply With Quote
Old 01-19-2017, 03:55 PM   #15
AnselmD
Zealot
AnselmD began at the beginning.
 
Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
Quote:
Originally Posted by Doitsu View Post
@AnselmD: The German dictionaries that Sigil is bundled with (de_De) are optimized for Windows and might cause problems with some characters such as curly apostrophes.

If you didn't install a custom German dictionary, try the following:

1. Locate your Sigil dictionary folder. (Default: C:\Program Files\Sigil\hunspell_dictionaries).

2. Make a backup copy of de_De.dic and de_De.aff and replace them with the files in this attachment.

3. Retest the spellcheck of the words that you had problems with.

If replacing the dictionary files doesn't solve your problem, please attach your custom dictionary and give some specific examples of words that you have problems with.
(To locate the custom dictionaries, select Edit > Preferences > Open Preferences Location > user_dictionaries.)
Hi Doitsu,

Yes, this works!!!

Where can i get working version of:
German (de-DE-1901) old spelling dictionaries-2016.04.03 | Apache OpenOffice Extensions
http://extensions.openoffice.org/de/...aries-20160403

or how can i convert it?
AnselmD is offline   Reply With Quote
Reply

Tags
bug report, feature request, punctuation, sigil, unicode


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spellcheck and some notes. brolny Sigil 0 11-24-2015 04:37 AM
SpellCheck - Abbreviation(?) Apostrophes Paulie_D Editor 10 01-08-2015 08:22 AM
Request for future spellcheck mrmikel Editor 1 03-21-2014 11:42 AM
Quick and Dirty Spellcheck? ManosHandsOfFate Workshop 3 03-07-2014 02:41 PM
SPELLCHECK NATION: Does SpellCheck have a dark side? cbaehr Self-Promotions by Authors and Publishers 10 11-07-2010 12:45 PM


All times are GMT -4. The time now is 09:12 AM.


MobileRead.com is a privately owned, operated and funded community.