MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Suggestion: Spellcheck Enhancement (Numbers) (https://www.mobileread.com/forums/showthread.php?t=292086)

Tex2002ans 11-12-2017 09:38 PM

Suggestion: Spellcheck Enhancement (Numbers)
 
3 Attachment(s)
Currently, Sigil does not consider Numbers as "words".

This means that Spellcheck can't catch entire classes of errors, because many "words" don't display in the Spellcheck List (Tools > Spellcheck > Spellcheck).

I've attached an Example EPUB to show the issue.

The Problem

This includes the "Current Spellcheck List" vs. "Proposed Spellcheck List" in Spoilers.

Example 1: Centuries or Years:

Code:

In the 21st century, [...]
In the 1800’s, there was [...]

Spoiler:
Code:

In          In
the        the
st          21st
century    century
s          1800’s
there      there
was        was



Example 2: Pounds/Shillings/Pence/Money

Code:

The device cost £14 8s 2d.
Spoiler:
Code:

The        The
device      device
cost        cost
            14
s          8s
d          2d



Example 3: Hyphenated Years or Age:

Code:

In the 10-year period between [...]
The 10-year-old girl [...]

Spoiler:
Code:

In        In
the        the
year      10-year
period    period
between    between
The        The
year-old  10-year-old



Example 4: Weights/Measures

Code:

It weighs 100.5lbs.
The length is 100.5km and 2ft.

Spoiler:
Code:

It        It
weighs    weighs
lbs        100.5lbs
The        The
length    length
is        is
km        100.5km
and        and
ft        2ft



Example 5: Indexes/Footnotes

Code:

Dogs, 123n., 125, 130n.
See p. 123ff.

Spoiler:
Code:

Dogs      Dogs
n        123n
          125
n        130n
See      See
p        p
ff        123ff



Example 6: A very common typo (especially because of OCR):

Code:

In the 196os, the president was [...]
In l941, the samples were [...]
Good argument, h0wever, you are [...]

Spoiler:
Code:

os        196os
l        l941
h0wever  h0wever



It in Action

Calibre already includes numbers in their Spellcheck:

Attachment 159978

and it is extremely helpful.

Proposal

There is one downside to the Calibre-method though, because the Spellcheck List gets flooded with numbers. Especially when dealing with HTML tables full of data (or in Indexes):

Attachment 159979

To get around that issue:
  • Ignore all "words" made completely of numbers + punctuation
    • Although I could see a usage for still keeping this (catching typos)
  • Include the "Numbers as Words" as a checkbox/toggle.
    • Similar to "Show All Words"
    • Probably default to OFF.
    • This could allow the user to choose whether they want to display those full-numbers or not.

Doitsu 11-13-2017 02:53 AM

I also think that a Numbers as Words checkbox would be a good idea, in particular for OCRed text.

I looked into this some time ago and found out from KevinH that a toggle would need to change line #143 in /src/Misc/HTMLSpellCheck.cpp from:

Code:

    if (c.isLetter()) {
to

Code:

    if (c.isLetterOrNumber()) {
I.e., in terms of programming this shouldn't be too complicated to implement.

@KevinH, @DiapDealer:

Since this line appears to define what a "letter" is in terms of spell-checking, it should be possible to add curly right apostrophes (’ U+2019) to the list of "letters" via an additional if clause. This would fix another frequently reported spell check problem.
To keep it simple, the algorithm would only have to accept curly right apostrophes in the middle of words.


EDIT: I must have misremembered this.

doubleshuffle 11-13-2017 03:00 AM

I would also appreciate such a feature. I'm working with a scan right now where the OCR has produced tons of "l" vs. "1" mix-ups. They are very hard to detect with the spellcheck as it is now.

Tex2002ans 11-13-2017 03:39 AM

Quote:

Originally Posted by doubleshuffle (Post 3610799)
I would also appreciate such a feature. I'm working with a scan right now where the OCR has produced tons of "l" vs. "1" mix-ups. They are very hard to detect with the spellcheck as it is now.

Currently, as a workaround, I use Calibre's Spellcheck (like pictured above).

Then I usually just type the numbers 0-9 in the search box one-by-one, and do a quick scan through the list to see if anything strange pops out.

And I have a few Regex that I use to try to minimize the impact:

Search: [lo]\d
Search: \d[lo]

That tries to catch things like "19l0" or "l910" or "It was 8.o5cm long".

I also try to use this:

Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. [lo]
Search: (January|February|March|April|August|September|Oct ober|November|December) [lo]

to try to catch the odd dates: "Jan. i5, 2017" or "March i, 1910" or "August i982".

doubleshuffle 11-13-2017 03:59 AM

Thanks. I'll try that.

KevinH 11-13-2017 09:29 AM

Quote:

Originally Posted by Doitsu (Post 3610797)
Since this line appears to define what a "letter" is in terms of spell-checking, it should be possible to add curly right apostrophes (’ U+2019) to the list of "letters" via an additional if clause. This would fix another frequently reported spell check problem.
To keep it simple, the algorithm would only have to accept curly right apostrophes in the middle of words.

As explained in that thread....

That is not a spellcheck bug, but a bug in the design of the German Hunspell dictionaries used.

Doitsu 11-13-2017 11:18 AM

Quote:

Originally Posted by KevinH (Post 3611023)
As explained in that thread....

That is not a spellcheck bug, but a bug in the design of the German Hunspell dictionaries used.

Sorry. I must've misremembered this. I thought this problem also occurred with English spellcheck dictionaries. :o

KevinH 11-13-2017 11:28 AM

Understood.
For the record, spellchecking works with utf-8 encoded dictionaries that actually have words with apostrophes (contractions and the like) in that dictionary wordlist.

The problem with the German dictionary in question is twofold:

1. It was encoded is iso-8859-1 not utf-8 and the smart single quote/ apostrophe does not exist in that single byte encoding
2. The dictionary wordlist itself did not include any words with apostophes (single quotes) in the dictionary at all.

These are problems the dictionary owner should fix.

That said, I will see about adding support and a preference setting for numbers. But please note: not all dictionaries add mixed letter number words to their dictionary wordlist, and so perfectly valid things will then get marked as bad.

Grepping for all digits and examining them is probably a safer approach.

Tex2002ans 11-13-2017 10:09 PM

1 Attachment(s)
Quote:

Originally Posted by Doitsu (Post 3610797)
Since this line appears to define what a "letter" is in terms of spell-checking, it should be possible to add curly right apostrophes (’ U+2019) to the list of "letters" via an additional if clause. This would fix another frequently reported spell check problem.
To keep it simple, the algorithm would only have to accept curly right apostrophes in the middle of words.

Actually, that reminds me of another Spellchecking bug that has been bothering me for a while.

I attached a Sample EPUB to show the problem.

Sample Code:

Quote:

<p>The draft which TJ submitted to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ’s comments on them.</p>

<p>The draft which TJ submitted to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ's comments on them.</p>
In Code View, you can see both TJ’s (smart quote) + TJ's (dumb quote) is marked as a misspelled word.

If you Right Click on the first one + Ignore... nothing happens.

If you Right Click on the second one + Ignore... both lose their red squigglies.

This is pretty frustrating. I have a Keyboard Shortcut set to the "Ignore" function, and it becomes painstaking to work through larger books, because the "smart quote" words never get properly Ignored.

This is an issue on Windows (not too sure about other OSes).

Quote:

Originally Posted by KevinH (Post 3611088)
That said, I will see about adding support and a preference setting for numbers.

:thumbsup:

Tex2002ans 11-14-2017 01:43 AM

I also thought of a few extra samples where numbers are "words".

Names of companies:

Code:

23andme tests your DNA.
Spoiler:
Code:

andme        23andme
tests        tests
your          your
DNA          DNA



Misc.:

Code:

Write this on A4 paper.
You are in Room B9.
This is a B-17 Bomber.

Spoiler:
Code:

A            A4
B            B9
B            B-17



It showing up in the Spellcheck List makes it very easy to jump to the location in the EPUB with a doubleclick.

Currently, the "A" in "A4" would be impossible to spot easily (there could be thousands of A in the book). And B9 + B-17 wouldn't stand apart.

Quote:

Originally Posted by doubleshuffle (Post 3610815)
Thanks. I'll try that.

Oh yeah, and another common OCR error you might run across:

Search: 8c
Replace: &
(or in the case of HTML) Replace: &amp;

8c is also something that sticks out easily in Calibre's Spellcheck List. :)

KevinH 11-14-2017 10:17 AM

I just tried this on Mac OS X with stock Sigil (most recent) and all of the TJ's and smart versions are marked as incorrectly spelled in CodeView on first start-up.

I then modified your test case to include a duplicate of each of your 2 lines so that I could see if later versions are properly ignored or not (see below).

Code:

<body>
  <p>The draft which TJ submitted to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ’s comments on them.</p>

  <p>The draft which TJ submitted to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ's comments on them.</p>

  <p>The draft which TJ submitted v2 to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ’s comments on them.</p>

  <p>The draft which TJ submitted v2 to Hamilton is in Ford, VI, 7-69, with Hamilton’s notes and TJ's comments on them.</p>
</body>

I then right clicked on the red squiggly line of the first smart version and selected "Ignore".

And correctly all later versions (both smart and dumb) are properly now not marked as wrong.

So I can not recreate your issue at all.

So what exact version of Sigil are you using? What version of Qt is it using? Have you installed your own dictionaries and if so which?


Quote:

Originally Posted by Tex2002ans (Post 3611383)
Actually, that reminds me of another Spellchecking bug that has been bothering me for a while.

I attached a Sample EPUB to show the problem.

Sample Code:



In Code View, you can see both TJ’s (smart quote) + TJ's (dumb quote) is marked as a misspelled word.

If you Right Click on the first one + Ignore... nothing happens.

If you Right Click on the second one + Ignore... both lose their red squigglies.

This is pretty frustrating. I have a Keyboard Shortcut set to the "Ignore" function, and it becomes painstaking to work through larger books, because the "smart quote" words never get properly Ignored.

This is an issue on Windows (not too sure about other OSes).



:thumbsup:


Doitsu 11-14-2017 11:09 AM

@KevinH:

This is most likely a Windows issue. On my Linux machine, right-clicking TJ’s in the first line and selecting Ignore removed all red squiggly lines.
(Right-clicking TJ's in the second line and selecting Ignore had the same effect.)

However, when I repeated the test on my Windows machine, right-clicking TJ’s in the first line and selecting Ignore didn't remove the red squiggly lines from any line.
When I right-clicked TJ's in the second/fourth line and selected Ignore, all red squiggly lines were gone. (I used Sigil 0.9.8 and the stock en_US and en_GB dictionaries for all tests.)

KevinH 11-14-2017 11:23 AM

Since it is not Sigil versions specific and not Qt specific, I wonder if this is Locale dependent? What Locale is set for Windows? What happens if you try other Locales? What is your default Windows encoding as well?

I guess it could be a Windows Qt specific bug? Perhaps they are confusing unicode smart quotes with Windows specific encoding smart quotes? Something funny is going on.

Tex2002ans 11-14-2017 01:45 PM

Quote:

Originally Posted by KevinH (Post 3611611)
So I can not recreate your issue at all.

So what exact version of Sigil are you using? What version of Qt is it using? Have you installed your own dictionaries and if so which?

(This also affected my Windows 7 machine for a looong time. This has been in Sigil for a while now.)

Windows 10, 64-bit.
Sigil 0.9.8
Qt 5.6.2

Never touched a dictionary file in my life. It's just whatever comes in default Sigil.

(This is just the default Sigil install, right from the site, no funny business.)

Quote:

Originally Posted by KevinH (Post 3611632)
Since it is not Sigil versions specific and not Qt specific, I wonder if this is Locale dependent? What Locale is set for Windows? What happens if you try other Locales? What is your default Windows encoding as well?

America! :P

Locale: English (United States)

Where can I check the "default Windows encoding"?

I'm assuming whatever the defaults are in a Windows 7/10 install. Never messed with those settings.

DiapDealer 11-14-2017 02:57 PM

My test on Windows Vista mirrors's Doitsu's results: clicking ignore on the one with the smart-quote had no effect ... clicking ignore on the one with the straight-quote removed the red misspelled line for all four occurrences (smart and dumb).

KevinH 11-14-2017 03:15 PM

It would be interesting to see the QChar values of the smart right single quoted word when it reaches the spellcheck code on Windows. This must be either a Qt specific bug in Windows or an encoding issue at some point as it works on both Linux and Mac.

I will eye-ball the code to see if I can find a suspect.

KevinH 11-14-2017 03:33 PM

I am betting the problem is here:
Code:

QString Utility::getSpellingSafeText(const QString &raw_text)
{
    // There is currently a problem with Hunspell if we attempt to pass
    // words with smart apostrophes from the CodeView encoding.
    // There are likely better ways to solve this, but this one does
    // get the job done until someone can implement something better.
    QString text(raw_text);
    return text.replace(QString::fromUtf8("\u2019"), "'");
}

Windows source files probably use different encodings instead of utf-8 and the unicode constant is \u2019 is not being properly converted to a utf-8 string in this function.

u2019 in utf-8 is a 3 byte sequence: 0xE2 0x80 0x99 and so fromUtf8 routine should be passed that byte sequence or we load QChar with u2019 and then use toUtf8 to generate the input or better yet use the QChar directly.

DiapDealer 11-14-2017 03:43 PM

Let me know if there's anything you need me to try compiling and/or testing on Windows.

KevinH 11-14-2017 03:46 PM

So a better way to write this might be:

return text.replace(QChar(0x2019),QChar(0x27));

DiapDealer, when you get a free moment, would you try that change in Misc/Utility.cpp in getSpellingSafeText and see if it makes any difference?

Thanks

KevinH 11-14-2017 04:11 PM

Do you want me to push that change? It may not help, but certainly should not hurt.

DiapDealer 11-14-2017 04:49 PM

Quote:

Originally Posted by KevinH (Post 3611815)
Do you want me to push that change? It may not help, but certainly should not hurt.

Yes, please do! It certainly seems to do the trick in my testing so far.

It also fixes the similar problem of adding words with smart-apostrophes to a user word-list (only adding a straight apos char would work previously).

KevinH 11-14-2017 06:09 PM

Glad to hear it! I will push it later this evening once I am back at my developer box.

KevinH 11-14-2017 07:32 PM

Just pushed that fix to master.

KevinH 11-15-2017 01:50 PM

Also, I have just pushed support for spellchecking words with numbers as controlled by a Sigil preference setting. That small change actually forced changes in many files and a ui dialog.

Please note, if your particular dictionary does not have any words with digits in them in their wordlist, this feature will not be of much help.

This feature should appear in the next release unless I messed something up.

DiapDealer 11-15-2017 06:58 PM

Quote:

Originally Posted by KevinH (Post 3612255)
Also, I have just pushed support for spellchecking words with numbers as controlled by a Sigil preference setting. That small change actually forced changes in many files and a ui dialog.

Please note, if your particular dictionary does not have any words with digits in them in their wordlist, this feature will not be of much help.

This feature should appear in the next release unless I messed something up.

Seems to work as intended so far. :thumbsup:

The only thing in the above mentioned situations that isn't covered (that I've noticed) is:

Quote:

This is a B-17 Bomber.
No hyphenated words show up as misspelled that I can see. Whether they contain numbers or not isn't really irrelevant.

KevinH 11-15-2017 07:33 PM

Words that have an internal normal dash (hyphen) should be spell checked properly given how the code handles them. If not, something is funny.

DiapDealer 11-15-2017 08:16 PM

Quote:

Originally Posted by KevinH (Post 3612373)
Words that have an internal normal dash (hyphen) should be spell checked properly given how the code handles them. If not, something is funny.

My bad. You're right. Questionable words on either side of the hyphen will mark the hyphenated word as misspelled. I was just tripped up by the fact that B-17 doesn't show up as a misspelling. Neither does A-14 F-70 Z-29 or D-11, regardless of the new number preference setting. Shouldn't things like that be flagged as potential misspellings?

KevinH 11-15-2017 10:11 PM

The individual letters A, B, etc and the numbers after the hyphen are all valid standalone words so they are legal hyphenated. That said that Gbh-17 should show up as wrong since Gbh is not a valid word. This also depends of the wordchar list provided in the en_US.aff file (or whatever dictionary aff file you are using.

Tex2002ans 11-16-2017 01:58 AM

Quote:

Originally Posted by DiapDealer (Post 3612363)
Seems to work as intended so far. :thumbsup:

Fantastic. Can't wait for the next version.

Quote:

Originally Posted by DiapDealer (Post 3612363)
No hyphenated words show up as misspelled that I can see. Whether they contain numbers or not isn't really irrelevant.

Edit: Whoops, read this Diap's post wrong. Ignore what I posted below. :rofl:

This wasn't necessarily about showing up as misspelled, it was about showing up in the list at all.

For example:

Code:

The Letter B, B-17 Bomber, and Room B9.
Would show up in the Spellcheck List as 3 "B".

When in reality, there is only 1 "B" + 1 "B-17" + 1 "B9".

This becomes a serious issue when it happens to something common, like "A", or the Index/Footnote Example, where there can be hundreds of "A" + "n" + "ff" + "f" within the EPUB. It becomes impossible to use the Spellcheck List to locate/find and correct these.

Or in the case of "l92l". That shows up at 2 "l". Good luck searching through every lowercase 'l' in the book trying to find it!

Doitsu 11-16-2017 11:34 AM

Quote:

Originally Posted by KevinH (Post 3612255)
Also, I have just pushed support for spellchecking words with numbers as controlled by a Sigil preference setting. That small change actually forced changes in many files and a ui dialog.

Thanks!

Quote:

Originally Posted by Tex2002ans (Post 3612504)
Or in the case of "l92l". That shows up at 2 "l". Good luck searching through every lowercase 'l' in the book trying to find it!

In the latest pre-release version, "l92l" will be marked as misspelled, if the new Check Numbers option is enabled. This should make it easier to find numbers with letters in them and vice versa, because all words that contain numbers and letters will be flagged as misspelled, if the new Check Numbers option is enabled.

DiapDealer 11-16-2017 11:45 AM

I believe it's now working in the way that those to whom his request is important are wishing it would work. So I'm going to shut up, now. :D

doubleshuffle 11-16-2017 12:32 PM

Quote:

Originally Posted by DiapDealer (Post 3612733)
I believe it's now working in the way that those to whom his request is important are wishing it would work.

Brilliant. Thanks!

Tex2002ans 11-16-2017 02:03 PM

Quote:

Originally Posted by DiapDealer (Post 3612733)
I believe it's now working in the way that those to whom his request is important are wishing it would work. So I'm going to shut up, now. :D

We shall see if the way it was implemented is to the satisfaction of the person who initially reported the issue! He will thank you profusely if it works as imagined! I will shut up, now. :D


All times are GMT -4. The time now is 07:02 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.