![]() |
Suggestion: Spellcheck Enhancement (Numbers)
3 Attachment(s)
Currently, Sigil does not consider Numbers as "words".
This means that Spellcheck can't catch entire classes of errors, because many "words" don't display in the Spellcheck List (Tools > Spellcheck > Spellcheck). I've attached an Example EPUB to show the issue. The Problem This includes the "Current Spellcheck List" vs. "Proposed Spellcheck List" in Spoilers. Example 1: Centuries or Years: Code:
In the 21st century, [...]Spoiler:
Example 2: Pounds/Shillings/Pence/Money Code:
The device cost £14 8s 2d.Spoiler:
Example 3: Hyphenated Years or Age: Code:
In the 10-year period between [...]Spoiler:
Example 4: Weights/Measures Code:
It weighs 100.5lbs.Spoiler:
Example 5: Indexes/Footnotes Code:
Dogs, 123n., 125, 130n.Spoiler:
Example 6: A very common typo (especially because of OCR): Code:
In the 196os, the president was [...]Spoiler:
It in Action Calibre already includes numbers in their Spellcheck: Attachment 159978 and it is extremely helpful. Proposal There is one downside to the Calibre-method though, because the Spellcheck List gets flooded with numbers. Especially when dealing with HTML tables full of data (or in Indexes): Attachment 159979 To get around that issue:
|
I also think that a Numbers as Words checkbox would be a good idea, in particular for OCRed text.
I looked into this some time ago and found out from KevinH that a toggle would need to change line #143 in /src/Misc/HTMLSpellCheck.cpp from: Code:
if (c.isLetter()) {Code:
if (c.isLetterOrNumber()) {@KevinH, @DiapDealer: Since this line appears to define what a "letter" is in terms of spell-checking, it should be possible to add curly right apostrophes (’ U+2019) to the list of "letters" via an additional if clause. This would fix another frequently reported spell check problem. To keep it simple, the algorithm would only have to accept curly right apostrophes in the middle of words. EDIT: I must have misremembered this. |
I would also appreciate such a feature. I'm working with a scan right now where the OCR has produced tons of "l" vs. "1" mix-ups. They are very hard to detect with the spellcheck as it is now.
|
Quote:
Then I usually just type the numbers 0-9 in the search box one-by-one, and do a quick scan through the list to see if anything strange pops out. And I have a few Regex that I use to try to minimize the impact: Search: [lo]\d Search: \d[lo] That tries to catch things like "19l0" or "l910" or "It was 8.o5cm long". I also try to use this: Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. [lo] Search: (January|February|March|April|August|September|Oct ober|November|December) [lo] to try to catch the odd dates: "Jan. i5, 2017" or "March i, 1910" or "August i982". |
Thanks. I'll try that.
|
Quote:
That is not a spellcheck bug, but a bug in the design of the German Hunspell dictionaries used. |
Quote:
|
Understood.
For the record, spellchecking works with utf-8 encoded dictionaries that actually have words with apostrophes (contractions and the like) in that dictionary wordlist. The problem with the German dictionary in question is twofold: 1. It was encoded is iso-8859-1 not utf-8 and the smart single quote/ apostrophe does not exist in that single byte encoding 2. The dictionary wordlist itself did not include any words with apostophes (single quotes) in the dictionary at all. These are problems the dictionary owner should fix. That said, I will see about adding support and a preference setting for numbers. But please note: not all dictionaries add mixed letter number words to their dictionary wordlist, and so perfectly valid things will then get marked as bad. Grepping for all digits and examining them is probably a safer approach. |
1 Attachment(s)
Quote:
I attached a Sample EPUB to show the problem. Sample Code: Quote:
If you Right Click on the first one + Ignore... nothing happens. If you Right Click on the second one + Ignore... both lose their red squigglies. This is pretty frustrating. I have a Keyboard Shortcut set to the "Ignore" function, and it becomes painstaking to work through larger books, because the "smart quote" words never get properly Ignored. This is an issue on Windows (not too sure about other OSes). Quote:
|
I also thought of a few extra samples where numbers are "words".
Names of companies: Code:
23andme tests your DNA.Spoiler:
Misc.: Code:
Write this on A4 paper.Spoiler:
It showing up in the Spellcheck List makes it very easy to jump to the location in the EPUB with a doubleclick. Currently, the "A" in "A4" would be impossible to spot easily (there could be thousands of A in the book). And B9 + B-17 wouldn't stand apart. Quote:
Search: 8c Replace: & (or in the case of HTML) Replace: & 8c is also something that sticks out easily in Calibre's Spellcheck List. :) |
I just tried this on Mac OS X with stock Sigil (most recent) and all of the TJ's and smart versions are marked as incorrectly spelled in CodeView on first start-up.
I then modified your test case to include a duplicate of each of your 2 lines so that I could see if later versions are properly ignored or not (see below). Code:
<body>And correctly all later versions (both smart and dumb) are properly now not marked as wrong. So I can not recreate your issue at all. So what exact version of Sigil are you using? What version of Qt is it using? Have you installed your own dictionaries and if so which? Quote:
|
@KevinH:
This is most likely a Windows issue. On my Linux machine, right-clicking TJ’s in the first line and selecting Ignore removed all red squiggly lines. (Right-clicking TJ's in the second line and selecting Ignore had the same effect.) However, when I repeated the test on my Windows machine, right-clicking TJ’s in the first line and selecting Ignore didn't remove the red squiggly lines from any line. When I right-clicked TJ's in the second/fourth line and selected Ignore, all red squiggly lines were gone. (I used Sigil 0.9.8 and the stock en_US and en_GB dictionaries for all tests.) |
Since it is not Sigil versions specific and not Qt specific, I wonder if this is Locale dependent? What Locale is set for Windows? What happens if you try other Locales? What is your default Windows encoding as well?
I guess it could be a Windows Qt specific bug? Perhaps they are confusing unicode smart quotes with Windows specific encoding smart quotes? Something funny is going on. |
Quote:
Windows 10, 64-bit. Sigil 0.9.8 Qt 5.6.2 Never touched a dictionary file in my life. It's just whatever comes in default Sigil. (This is just the default Sigil install, right from the site, no funny business.) Quote:
Locale: English (United States) Where can I check the "default Windows encoding"? I'm assuming whatever the defaults are in a Windows 7/10 install. Never messed with those settings. |
My test on Windows Vista mirrors's Doitsu's results: clicking ignore on the one with the smart-quote had no effect ... clicking ignore on the one with the straight-quote removed the red misspelled line for all four occurrences (smart and dumb).
|
It would be interesting to see the QChar values of the smart right single quoted word when it reaches the spellcheck code on Windows. This must be either a Qt specific bug in Windows or an encoding issue at some point as it works on both Linux and Mac.
I will eye-ball the code to see if I can find a suspect. |
I am betting the problem is here:
Code:
QString Utility::getSpellingSafeText(const QString &raw_text)u2019 in utf-8 is a 3 byte sequence: 0xE2 0x80 0x99 and so fromUtf8 routine should be passed that byte sequence or we load QChar with u2019 and then use toUtf8 to generate the input or better yet use the QChar directly. |
Let me know if there's anything you need me to try compiling and/or testing on Windows.
|
So a better way to write this might be:
return text.replace(QChar(0x2019),QChar(0x27)); DiapDealer, when you get a free moment, would you try that change in Misc/Utility.cpp in getSpellingSafeText and see if it makes any difference? Thanks |
Do you want me to push that change? It may not help, but certainly should not hurt.
|
Quote:
It also fixes the similar problem of adding words with smart-apostrophes to a user word-list (only adding a straight apos char would work previously). |
Glad to hear it! I will push it later this evening once I am back at my developer box.
|
Just pushed that fix to master.
|
Also, I have just pushed support for spellchecking words with numbers as controlled by a Sigil preference setting. That small change actually forced changes in many files and a ui dialog.
Please note, if your particular dictionary does not have any words with digits in them in their wordlist, this feature will not be of much help. This feature should appear in the next release unless I messed something up. |
Quote:
The only thing in the above mentioned situations that isn't covered (that I've noticed) is: Quote:
|
Words that have an internal normal dash (hyphen) should be spell checked properly given how the code handles them. If not, something is funny.
|
Quote:
|
The individual letters A, B, etc and the numbers after the hyphen are all valid standalone words so they are legal hyphenated. That said that Gbh-17 should show up as wrong since Gbh is not a valid word. This also depends of the wordchar list provided in the en_US.aff file (or whatever dictionary aff file you are using.
|
Quote:
Quote:
This wasn't necessarily about showing up as misspelled, it was about showing up in the list at all. For example: Code:
The Letter B, B-17 Bomber, and Room B9.When in reality, there is only 1 "B" + 1 "B-17" + 1 "B9". This becomes a serious issue when it happens to something common, like "A", or the Index/Footnote Example, where there can be hundreds of "A" + "n" + "ff" + "f" within the EPUB. It becomes impossible to use the Spellcheck List to locate/find and correct these. Or in the case of "l92l". That shows up at 2 "l". Good luck searching through every lowercase 'l' in the book trying to find it! |
Quote:
Quote:
|
I believe it's now working in the way that those to whom his request is important are wishing it would work. So I'm going to shut up, now. :D
|
Quote:
|
Quote:
|
| All times are GMT -4. The time now is 07:02 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.