11-12-2017, 08:38 PM | #1 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Suggestion: Spellcheck Enhancement (Numbers)
Currently, Sigil does not consider Numbers as "words".
This means that Spellcheck can't catch entire classes of errors, because many "words" don't display in the Spellcheck List (Tools > Spellcheck > Spellcheck). I've attached an Example EPUB to show the issue. The Problem This includes the "Current Spellcheck List" vs. "Proposed Spellcheck List" in Spoilers. Example 1: Centuries or Years: Code:
In the 21st century, [...] In the 1800’s, there was [...] Spoiler:
Example 2: Pounds/Shillings/Pence/Money Code:
The device cost £14 8s 2d. Spoiler:
Example 3: Hyphenated Years or Age: Code:
In the 10-year period between [...] The 10-year-old girl [...] Spoiler:
Example 4: Weights/Measures Code:
It weighs 100.5lbs. The length is 100.5km and 2ft. Spoiler:
Example 5: Indexes/Footnotes Code:
Dogs, 123n., 125, 130n. See p. 123ff. Spoiler:
Example 6: A very common typo (especially because of OCR): Code:
In the 196os, the president was [...] In l941, the samples were [...] Good argument, h0wever, you are [...] Spoiler:
It in Action Calibre already includes numbers in their Spellcheck: and it is extremely helpful. Proposal There is one downside to the Calibre-method though, because the Spellcheck List gets flooded with numbers. Especially when dealing with HTML tables full of data (or in Indexes): To get around that issue:
|
11-13-2017, 01:53 AM | #2 |
Grand Sorcerer
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
I also think that a Numbers as Words checkbox would be a good idea, in particular for OCRed text.
I looked into this some time ago and found out from KevinH that a toggle would need to change line #143 in /src/Misc/HTMLSpellCheck.cpp from: Code:
if (c.isLetter()) { Code:
if (c.isLetterOrNumber()) { @KevinH, @DiapDealer: Since this line appears to define what a "letter" is in terms of spell-checking, it should be possible to add curly right apostrophes (’ U+2019) to the list of "letters" via an additional if clause. This would fix another frequently reported spell check problem. To keep it simple, the algorithm would only have to accept curly right apostrophes in the middle of words. EDIT: I must have misremembered this. Last edited by Doitsu; 11-13-2017 at 10:17 AM. |
Advert | |
|
11-13-2017, 02:00 AM | #3 |
Unicycle Daredevil
Posts: 13,923
Karma: 185041098
Join Date: Jan 2011
Location: Planet of the Pudding Brains
Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3
|
I would also appreciate such a feature. I'm working with a scan right now where the OCR has produced tons of "l" vs. "1" mix-ups. They are very hard to detect with the spellcheck as it is now.
|
11-13-2017, 02:39 AM | #4 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Then I usually just type the numbers 0-9 in the search box one-by-one, and do a quick scan through the list to see if anything strange pops out. And I have a few Regex that I use to try to minimize the impact: Search: [lo]\d Search: \d[lo] That tries to catch things like "19l0" or "l910" or "It was 8.o5cm long". I also try to use this: Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. [lo] Search: (January|February|March|April|August|September|Oct ober|November|December) [lo] to try to catch the odd dates: "Jan. i5, 2017" or "March i, 1910" or "August i982". Last edited by Tex2002ans; 11-13-2017 at 02:44 AM. |
|
11-13-2017, 02:59 AM | #5 |
Unicycle Daredevil
Posts: 13,923
Karma: 185041098
Join Date: Jan 2011
Location: Planet of the Pudding Brains
Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3
|
Thanks. I'll try that.
|
Advert | |
|
11-13-2017, 08:29 AM | #6 | |
Sigil Developer
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Quote:
That is not a spellcheck bug, but a bug in the design of the German Hunspell dictionaries used. |
|
11-13-2017, 10:18 AM | #7 |
Grand Sorcerer
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
|
11-13-2017, 10:28 AM | #8 |
Sigil Developer
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Understood.
For the record, spellchecking works with utf-8 encoded dictionaries that actually have words with apostrophes (contractions and the like) in that dictionary wordlist. The problem with the German dictionary in question is twofold: 1. It was encoded is iso-8859-1 not utf-8 and the smart single quote/ apostrophe does not exist in that single byte encoding 2. The dictionary wordlist itself did not include any words with apostophes (single quotes) in the dictionary at all. These are problems the dictionary owner should fix. That said, I will see about adding support and a preference setting for numbers. But please note: not all dictionaries add mixed letter number words to their dictionary wordlist, and so perfectly valid things will then get marked as bad. Grepping for all digits and examining them is probably a safer approach. Last edited by KevinH; 11-14-2017 at 10:20 AM. |
11-13-2017, 09:09 PM | #9 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I attached a Sample EPUB to show the problem. Sample Code: Quote:
If you Right Click on the first one + Ignore... nothing happens. If you Right Click on the second one + Ignore... both lose their red squigglies. This is pretty frustrating. I have a Keyboard Shortcut set to the "Ignore" function, and it becomes painstaking to work through larger books, because the "smart quote" words never get properly Ignored. This is an issue on Windows (not too sure about other OSes). |
||
11-14-2017, 12:43 AM | #10 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
I also thought of a few extra samples where numbers are "words".
Names of companies: Code:
23andme tests your DNA. Spoiler:
Misc.: Code:
Write this on A4 paper. You are in Room B9. This is a B-17 Bomber. Spoiler:
It showing up in the Spellcheck List makes it very easy to jump to the location in the EPUB with a doubleclick. Currently, the "A" in "A4" would be impossible to spot easily (there could be thousands of A in the book). And B9 + B-17 wouldn't stand apart. Oh yeah, and another common OCR error you might run across: Search: 8c Replace: & (or in the case of HTML) Replace: & 8c is also something that sticks out easily in Calibre's Spellcheck List. |
11-14-2017, 09:17 AM | #11 | |
Sigil Developer
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
|
I just tried this on Mac OS X with stock Sigil (most recent) and all of the TJ's and smart versions are marked as incorrectly spelled in CodeView on first start-up.
I then modified your test case to include a duplicate of each of your 2 lines so that I could see if later versions are properly ignored or not (see below). Code:
<body> <p>The draft which TJ submitted to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ’s comments on them.</p> <p>The draft which TJ submitted to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ's comments on them.</p> <p>The draft which TJ submitted v2 to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ’s comments on them.</p> <p>The draft which TJ submitted v2 to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ's comments on them.</p> </body> And correctly all later versions (both smart and dumb) are properly now not marked as wrong. So I can not recreate your issue at all. So what exact version of Sigil are you using? What version of Qt is it using? Have you installed your own dictionaries and if so which? Quote:
|
|
11-14-2017, 10:09 AM | #12 |
Grand Sorcerer
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
@KevinH:
This is most likely a Windows issue. On my Linux machine, right-clicking TJ’s in the first line and selecting Ignore removed all red squiggly lines. (Right-clicking TJ's in the second line and selecting Ignore had the same effect.) However, when I repeated the test on my Windows machine, right-clicking TJ’s in the first line and selecting Ignore didn't remove the red squiggly lines from any line. When I right-clicked TJ's in the second/fourth line and selected Ignore, all red squiggly lines were gone. (I used Sigil 0.9.8 and the stock en_US and en_GB dictionaries for all tests.) |
11-14-2017, 10:23 AM | #13 |
Sigil Developer
Posts: 7,635
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Since it is not Sigil versions specific and not Qt specific, I wonder if this is Locale dependent? What Locale is set for Windows? What happens if you try other Locales? What is your default Windows encoding as well?
I guess it could be a Windows Qt specific bug? Perhaps they are confusing unicode smart quotes with Windows specific encoding smart quotes? Something funny is going on. Last edited by KevinH; 11-14-2017 at 10:26 AM. |
11-14-2017, 12:45 PM | #14 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Windows 10, 64-bit. Sigil 0.9.8 Qt 5.6.2 Never touched a dictionary file in my life. It's just whatever comes in default Sigil. (This is just the default Sigil install, right from the site, no funny business.) Quote:
Locale: English (United States) Where can I check the "default Windows encoding"? I'm assuming whatever the defaults are in a Windows 7/10 install. Never messed with those settings. Last edited by Tex2002ans; 11-14-2017 at 12:49 PM. |
||
11-14-2017, 01:57 PM | #15 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
My test on Windows Vista mirrors's Doitsu's results: clicking ignore on the one with the smart-quote had no effect ... clicking ignore on the one with the straight-quote removed the red misspelled line for all four occurrences (smart and dumb).
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Spellcheck in book view + selected text spellcheck | unfairrobot | Sigil | 2 | 12-19-2016 04:50 PM |
Suggestion: Spell Check Tool Enhancement | Tex2002ans | Editor | 35 | 07-11-2014 07:02 AM |
Enhancement suggestion - "Range" for series | HarryT | Calibre | 5 | 05-19-2012 03:58 PM |
SPELLCHECK NATION: Does SpellCheck have a dark side? | cbaehr | Self-Promotions by Authors and Publishers | 10 | 11-07-2010 12:45 PM |
Enhancement suggestion. | moggie | Calibre | 1 | 01-01-2009 01:35 PM |