![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 100
Karma: 1204
Join Date: Jun 2012
Device: Bookari (née Mantano Reader) on Android; Kindle Fire HD
|
Spellcheck and apostrophes
This follows up on this thread from January 2014 and this thread from February 2015. (Yes, this means I’m two months late with this thread.)
As the other folk have noted, Sigil’s spellcheck has trouble with words with Unicode curly apostrophes (at least on Windows). I’ve found a partial work-around: when I want to add a word to a personal dictionary, I first replace the curly apostrophe with a straight one—this lets Sigil add the word to the dictionary, and thereafter it will not flag the word even with a Unicode apostrophe. To those familiar with the spellcheck-interface code, this could probably suffice to point to a patch: when adding a word to the dictionary, replace curly apostrophes with the straight ones. |
![]() |
![]() |
![]() |
#2 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
|
That is strange. Are you sure your smart quotes are truly unicode smart quotes, and not microsoft cp smart quotes or latin-1 smart quotes? And what dictionary are you testing this with? Is it the en_US dictionary? Or one of your on wordlist dictionaries.
The Hunspell en_US dictionary we use always converts words with all smart quotes to dumb ones, before attempting to check the word, as this is how the words in the en_US dictionary are stored. So if you add your owns words to a wordlist, you should be doing the same. Also what version of Sigil are you using? And are the epub's xhtml files properly utf-8 encoded? (if you input from Windows text files this is rarely the case as Windows is stuck with legacy codepage text encoding instead of straight utf-8). |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
|
I just checked and if you hit ignore word and that word uses the curly version of the apostrophe, it is automatically converted to the dumb version before being added to the Sigil dictionary.
That said, user wordlists are always utf-8 encoded so non-dumb apostrophes will be left alone there. I will look into if that should be changed and if so how. |
![]() |
![]() |
![]() |
#4 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 100
Karma: 1204
Join Date: Jun 2012
Device: Bookari (née Mantano Reader) on Android; Kindle Fire HD
|
Most recent version, on Windows; not sure about Ignore, but Add Word did not work for words with curly apostrophes in UTF-8 encoded files. (Unlikely that this would matter, but I have two user dictionaries active.)
|
![]() |
![]() |
![]() |
#5 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 100
Karma: 1204
Join Date: Jun 2012
Device: Bookari (née Mantano Reader) on Android; Kindle Fire HD
|
Confirming: the files are UTF8-encoded, and neither Ignore nor Add to Dictionary work for words with U+2019 (’) curly apostrophes. Sigil 0.9.5 64-bit on Windows 10.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Zealot
![]() Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
|
I am working with Sigil 0.9.7 Windows 32bit.
As i can see, the words are add to the dictionary, but are still recognized as misspelled in the text. This (’) is the German Auslassungszeichen, so i do not like to change it to a ('). Apostroph – Wikipedia https://de.wikipedia.org/wiki/Apostr...afisch_korrekt |
![]() |
![]() |
![]() |
#7 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
|
Have you set the proper language code for the document? Have you installed the German Hunspell dictionary and if so which "wordchars" are set for that language.
I could never recreate the issue the original poster had when I used a correct dictionary and set the language correctly. Please verify. Also please create a small public domain epub testcase that shows the error and attach it here. KevinH |
![]() |
![]() |
![]() |
#8 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,680
Karma: 23983815
Join Date: Dec 2010
Device: Kindle PW2
|
@AnselmD: The German dictionaries that Sigil is bundled with (de_De) are optimized for Windows and might cause problems with some characters such as curly apostrophes.
If you didn't install a custom German dictionary, try the following: 1. Locate your Sigil dictionary folder. (Default: C:\Program Files\Sigil\hunspell_dictionaries). 2. Make a backup copy of de_De.dic and de_De.aff and replace them with the files in this attachment. 3. Retest the spellcheck of the words that you had problems with. If replacing the dictionary files doesn't solve your problem, please attach your custom dictionary and give some specific examples of words that you have problems with. (To locate the custom dictionaries, select Edit > Preferences > Open Preferences Location > user_dictionaries.) |
![]() |
![]() |
![]() |
#9 |
actually it is /var/log
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
|
There is this piece of code in Sigil:
Code:
QString Utility::getSpellingSafeText(const QString &raw_text) { // There is currently a problem with Hunspell if we attempt to pass // words with smart apostrophes from the CodeView encoding. // There are likely better ways to solve this, but this one does // get the job done until someone can implement something better. QString text(raw_text); return text.replace(QString::fromUtf8("\u2019"), "'"); } |
![]() |
![]() |
![]() |
#10 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
|
Yes and it is there to prevent dictionaries from having to have duplicate entries for every valid word that uses an apostrophe (one for dumb and one for smart). It use causes problems but ....
We think another possible cause of the bug is because user dictionaries and ignored words are added to the hunspell dictionary in the encoding specified by the dictionary .aff SET line. Unfortunately the rsquo char is technically not part of the ISO8859-1 encoding used by the German dictionary and when encoded to match that encoding on some platforms results with a missing character '?' marker. Thus Doitsu's modified German Dictionary to see if that issue is affected. Thanks, KevinH |
![]() |
![]() |
![]() |
#11 | ||
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,600
Karma: 29709834
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
![]() Quote:
Assuming a book is written in English then: if it was published in the UK I use a British dictionary, if it was published in the US I use a US dictionary, if it's an Australian book I use an Australian dictionary. If it was published elsewhere and I don't have a country specific English dictionary I use the British English dictionary. If I discover that the book, whilst being published (reprinted) outside the US, is using US spelling I switch to the US dictionary. And vice versa, yes it happens, particularly if the book was originally written in a language other than English (yes, that happens too), translated in the UK, and republished (reprinted) in the US. This is not a situation peculiar to English, other languages also have country specific dictionaries, e.g. French (France, Belgium, Canada etc), Spanish (Spain, Argentina, Mexico etc), Arabic (Morocco, Egypt, Syria etc), even Nepali (Nepal and India), and Chinese of course! I've learnt to live with hunspell's inability to deal with apostrophes in English. I'm saying it's a hunspell deficiency because I have two other hunspell based spell checkers that exhibit the same misbehaviour. If there's a workaround that is only applied when a specific US English Dictionary is used, then perhaps it could be be made country and dictionary source agnostic, most English variants have similar rules regarding the use of apostrophes. Except maybe their use in possessives ending with 'ess', 'ecs' and 'zed/zee' <sigh> BR' Last edited by BetterRed; 01-17-2017 at 09:17 PM. |
||
![]() |
![]() |
![]() |
#12 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,439
Karma: 5702578
Join Date: Nov 2009
Device: many
|
The "we" here is actually just "me" as I am the one who designed and created the MySpell spell checker that later became the basis for hunspell. At its root, a dictionary is simply a list of all of the words in the working set of a language. To make the list workable in size you end up needing prefix and suffix compression as well as limiting the set of usable letters used in the dictionary. It makes no sense to include every single possessive word in the wordlist twice. So dictionaries standardized on using a normal apostrophe in the wordlist. To spellcheck a word, you make a copy of it and replace any fancy single quotes with an apostrophe so it can be checked efficiently in the dictionary word list. I also produced the first en_US dictionary for MySpell from established wordlists to make that rule work. Numerous others dictionaries have followed that rule. And hunspell inherited that behaviour from MySpell.
It really isn't much of a limitation for efficient dictionary wordlist lookup provided suggestions as simple re or find and replace can convert any punctuation and apostrophies to their smart equivalent easily if that us simething the end user wants. As for "correct" dictionaries, in this case it means the user has chosen a dictionary that is encoded in in a charset that actually supports the characters he or she uses in the language. The ISO-8859-1 charset does not actually have a smart single quote in it and that is the encoding the de_DE dictionary uses. KevinH Last edited by KevinH; 01-17-2017 at 09:48 PM. |
![]() |
![]() |
![]() |
#13 | |
Zealot
![]() Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
|
Quote:
I installed this dictionary: German (de-DE-1901) old spelling dictionaries-2016.04.03 | Apache OpenOffice Extensions http://extensions.openoffice.org/de/...aries-20160403 C:\Users\xyz\AppData\Local\sigil-ebook\sigil\hunspell_dictionaries 03.04.2016 16:00 33.103 de_DE_OLDSPELL.aff 03.04.2016 16:00 4.253.179 de_DE_OLDSPELL.dic 03.04.2016 16:00 82.926 hyph_de_DE_OLDSPELL.dic Preferences->Spellcheck Dictionaries->Dictionary: de_DE_OLDSPELL What do you mean with this? |
|
![]() |
![]() |
![]() |
#14 |
Zealot
![]() Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
|
ok, I attached it
|
![]() |
![]() |
![]() |
#15 | |
Zealot
![]() Posts: 105
Karma: 10
Join Date: Oct 2013
Device: none
|
Quote:
Yes, this works!!! Where can i get working version of: German (de-DE-1901) old spelling dictionaries-2016.04.03 | Apache OpenOffice Extensions http://extensions.openoffice.org/de/...aries-20160403 or how can i convert it? |
|
![]() |
![]() |
![]() |
Tags |
bug report, feature request, punctuation, sigil, unicode |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Spellcheck and some notes. | brolny | Sigil | 0 | 11-24-2015 04:37 AM |
SpellCheck - Abbreviation(?) Apostrophes | Paulie_D | Editor | 10 | 01-08-2015 08:22 AM |
Request for future spellcheck | mrmikel | Editor | 1 | 03-21-2014 11:42 AM |
Quick and Dirty Spellcheck? | ManosHandsOfFate | Workshop | 3 | 03-07-2014 02:41 PM |
SPELLCHECK NATION: Does SpellCheck have a dark side? | cbaehr | Self-Promotions by Authors and Publishers | 10 | 11-07-2010 12:45 PM |