Spellcheck filter - upper/lower case

Binchen · 08-15-2021, 07:21 AM

The filter in the spellcheck does nor distinguish between upper and lower case, searching for U shows all words with u as well.

Bug or feature?

Binchen

KevinH · 08-15-2021, 07:38 AM

Not a bug. It works as designed. The code, just like other filters used in Sigil, is purposely made case insensitive via a lower casing.

Code:

void SpellcheckEditor::FilterEditTextChangedSlot(const QString &text)
{
    const QString lowercaseText = text.toLower();
    QModelIndex root_index = m_SpellcheckEditorModel->indexFromItem(m_SpellcheckEditorModel->invisibleRootItem());

    for (int row = 0; row < m_SpellcheckEditorModel->invisibleRootItem()->rowCount(); row++) {
        QStandardItem *item = m_SpellcheckEditorModel->item(row, 0);
        bool hidden = !(text.isEmpty() || item->text().toLower().contains(lowercaseText));
        ui.SpellcheckEditorTree->setRowHidden(item->row(), root_index, hidden);
    }
}

Turtle91 · 08-15-2021, 08:44 AM

In the find/replace section there is an option for Case Sensitive searches. Is it possible to have a checkbox to turn that on or off in the spellcheck function?

I don't know of many words in English that would be mis-spelled with a capital that wouldn't also be mis-spelled as lower case, but I am certainly not fluent in all the languages available to Sigil. Also, as a work-around, the user could certainly use the find/replace function to find words with an aberrant caPital.

KevinH · 08-15-2021, 09:34 AM

Actually all spellchecking itself can be case sensitive since many languages like German capitalize nouns (not just proper nouns) and they are incorrect if not properly capitalized. The dictionary used determines if capitalization matters, not Sigil.

Turtle91 · 08-15-2021, 10:03 AM

Here's a basic regex that can check for capitals inside a word....although the DOCTYPE statement gives a bunch of matches...

find: (?<=\w)[A-Z]

caseym54 · 10-23-2021, 04:07 PM

Is there a way to use regex inside the spellcheck search?

Example: Often misspelled words than erroneously end in "l" are supposed to have exclamation points. Using the normal search has too many false positives, but after filtering for spelling, it would speed up things. Full regex find/replace would be nice within the spellcheck, too.

Is there any plugin that does this?

Tex2002ans · 10-23-2021, 05:23 PM

Quote:

Originally Posted by caseym54

Is there a way to use regex inside the spellcheck search?

Example: Often misspelled words than erroneously end in "l" are supposed to have exclamation points.

The way that I currently handle this is in multiple passes.

In Tools > Spellcheck > Spellcheck (Alt+Q):

1. I type in lowercase 'l' (or whatever letter I'm looking for).

Then I toggle the "Show All Words" checkbox.

Very likely those "l instead of !" words will appear in the "misspelled words" spellcheck list.

- - -

Side Note: In Calibre's Spellcheck List, there's also a "Case sensitive search" checkbox.

Extremely helpful in this case, because you don't want capital 'L' words clogging up your list.

- - -

2. After I correct, I toggle the checkbox again (so all "correctly spelled words"), then scroll through and see if I can spot any oddities.

3. Then one more pass at the "misspelled" list.

Note: I do a similar passes with 'l' -> '1' or 'o' -> '0' OCR errors like:

Code:

l98l
198o
196os
h0wever

That's one of the reasons why I requested Spellcheck Lists to support numbers back in 2017.

Calibre has always supported numbers. In Sigil, you need to enable it in Edit > Preferences > Spellcheck Dictionaries > "Check Numbers" (tiny checkbox in the very upper right corner).

Side Note #2: You can also use a similar trick to catch accidental/inconsistent hyphens.

I wrote about it in 2013!

Quote:

Originally Posted by caseym54

Using the normal search has too many false positives, but after filtering for spelling, it would speed up things. Full regex find/replace would be nice within the spellcheck, too.

Is there any plugin that does this?

No, but I did recently discuss/brainstorm an "Advanced Find/Replace" concept about a month ago in a random Calibre topic.

So you'd have a Spellcheck List-type menu with:

Find
Replace
Filter

and 3 sortable columns:

Found
Replace
# Hits

You'd be able to selectively apply Find/Replace ONLY on specific rows.

(Currently, you can only Find/Replace one-by-one OR Replace All... Similar to the slowness of spellchecking/grammarchecking documents one-by-one vs. mass checking in list form!)

The Technical Details

Here's the relevant posts from that thread:

Spoiler:

Quote:

Originally Posted by Tex2002ans

2. A tool like Bulk Rename Utility allows you to mass search/replace filenames:

Attachment 189406

You fill out your parameters below.

Then you select which files you want to apply it to (Ctrl+Click/Shift+Click).

It puts green highlight on the files that'll actually change, and shows you the before/after in 2 columns.

Quote:

Originally Posted by Tex2002ans

I also believe this would be helpful in the normal large Find/Replaces (with a handful of edge cases).

Like this thread. A giant Find/Replace to switch all "123" -> "spelled-out numbers" form.

100 replaces were fine:

Chapter 21 -> Chapter Twenty-One
I was 2 years old -> I was two years old

[...]

A Sortable/Searchable (List-Based?) Differ (Advanced Find/Replace?)

When the amount of changes are overwhelming (in the hundreds/thousands).

Similar to the Spellcheck List, you'd be able to type in a:

- Find
- Replace

Run this on a book (like pressing "Count All") and generate a list:

- Find: Chapter \d+

You'd get a list of all hits:

Code:

Found       |  Replace   |  Hits
Chapter 1   |            |     1
Chapter 2   |            |     1
Chapter 3   |            |     1
Chapter 4   |            |     1
[...]
Chapter 100 |            |     1

You'd be able to double-click on any entry and jump to its location.

And, similar to the Spellcheck List, you can search/sort through this:

- Search: 1

Code:

Found       |  Replace   |  Hits
Chapter 1   |            |     1
Chapter 10  |            |     1
Chapter 11  |            |     1
Chapter 12  |            |     1
[...]
Chapter 100 |            |     1

- Search: 10

Code:

Found       |  Replace   |  Hits
Chapter 10  |            |     1
Chapter 100 |            |     1

You'd also be able to do a Replace:

- Find: Chapter (\d+)
- Replace: Chap. \1

Code:

Found       |  Replace   |  Hits
Chapter 1   |  Chap. 1   |     1
Chapter 2   |  Chap. 2   |     1
Chapter 3   |  Chap. 3   |     1
Chapter 4   |  Chap. 4   |     1
[...]
Chapter 100 |  Chap. 100 |     1

Here, I can also scroll through the list and accept/reject certain replaces.

Maybe, sorting by Hits, there would be a:

Code:

Chapter 5   |  Chap. 5   |     5

so you scratch your head, take a closer look, and maybe the book has a few:

See Chapter 5 for more information.

You may want to treat that differently than:

<h2>Chapter 5</h2>

so you'd apply the change to all 99 other replaces first, then you can dig in to that oddity in more detail.

And then over the following days, I discussed even better use-cases + concepts via PMs:

PM #1

You see that "Chapter (\d+)" example I gave?

Anyway, when I woke up, I thought of few other cases where I'd find that type of workflow extremely useful.

One search/replace where you do thousands, but want to deny a few exceptions, is EN DASHES:

* * *

Current Method

What I typically do is this:

Search: (\d+)-(\d+)
Replace: \1–\2

but then I have to be very careful with URLs, ISBNs, etc.

So what I currently do is split it into separate, smaller steps.

Anything with a "pp." or "p." before it? Replace All.
Open up the Index? "Current File" -> Replace All.
Then I go through step-by-step and have to manually do the rest.
- (Or I "hack" the Spellcheck List with numbers, then search for a hyphen to see what I'm looking at. :P)
- If I catch any oddities at that step, I make sure to NOT Replace All, and may tackle chapters one-at-a-time with "Current File".

* * *

Sortable Advanced Find & Replace List

I could do something like:

Search: \b(p+)\.* (\d+)-(\d+)
Replace: pp. \2–\3

Running that, you'd get a giant sortable list of:

Code:

pp. 123-125
pp. 125-127
p 123-125
p. 125-127
pp 130-135
pp. 123-5
pp 125-7

so, at a glance, I can see already see errors in the book (missing periods + some not in 3-digit form).

Then I'd be able to do multiple passes:

Filter: \.

Code:

pp. 123-125
pp. 125-127
p. 125-127 <--- Single "p." error
pp. 123-5 <--- Inconsistency

Great. Replace the first 3. Then double-click on the "pp. 123-5" and/or manually correct to "pp. 123–125".

Blank the Filter, and now I'm left with:

Code:

p 123-125
pp 130-135
pp 125-7 <--- Inconsistency

Again, the first 2 can be replaced, but the 3rd one needs the 3-digit form.

Being able to see some Advanced "Count All", at a glance, in a sortable list... I think this would be some ultimate power move. :P

(Although yes, yes,... what if someone puts in some insane Regex that grabs entire paragraphs... how would that get shoved/displayed in the lists... lol.)

Anyway, I don't currently know of any tool that does this. As I explained in those Calibre posts, I see bits and pieces here and there, but nothing that displays them in easy-to-read lists like the Spellcheck Lists!

* * *

Side Note:

Non-Linear Editing

Another fantastic thing I've been doing lately is editing using Regex.

1. A common thing in Fiction is "creative dialogue tags".

Instead of saying "said", authors may write things like:

opined
accused
agreed
beseeched

Code:

Found              |  Replace       |  Hits
,” Alex opined     | ,” Alex said   |    10
,” Suzie accused   | ,” Suzie said  |     9
,” Joanne agreed   | ,” Joanne said |     4
,” she beseeched   | ,” she said    |     1

2. Or Normalizing "said he" -> "he said"

Search: ,” (said) \b(Alex|Bob|Joanne|Suzie|s*he|they)\b
Replace: ,” \2 \1

Being able to run a Regex like that across an entire book, see a generated list of all usages... it would be GLORIOUS.

3. Or something like:

Search: ([!\?]”) ([A-Z])(\w+) (\w+)
Replace: \1 \L\2\3 \4

to catch accidentally capitalized letters after '!' or '?' when they should be lowercase! Example:

✗ “What did you say?” He asked.
✓ “What did you say?” he asked.
✗ “Time to die!” He yelled.
✓ “Time to die!” he yelled.
“Attack! Fight for your life!” Alex jumped onto the ship, swinging his sword.
- The vast majority fall into this "don't change" category.

* * * * * * * * *

PM #3

[...]

Or something similar to that image I showed in Bulk Rename Utility.

You'd have Before/After columns.

Out of those rows, you select which ones you want to apply to. It highlights those rows different, so then you can see what the heck it'll actually change it to.

If you're satisfied, then you press the button and it mass replaces those.

So let's say I run something like:

Search: (\w+)</p>\s+<p>([a-z])
Replace: \1 \2

you'd get a giant list of:

Code:

Before                           | After
_________________________________|___________________
And how</p>                      | And how are you?
<p>are you?                      |
                                 |
And another</p>                  | And another one is here.</p>
<p>one is here.</p>              |

Heh, but kind of like I mentioned in that PM to KevinH... no idea how to display list-forms when the person shoves in huge regex (like capturing entire HTML files or enormous paragraphs).

caseym54 · 10-29-2021, 03:26 AM

Thanks for that very helpful reply.

Eventually I got around to this (and some variations)

(?!tm)([a-z][a-z])l\s or maybe a quote. Problem with"html in headers"
\1\2

Tex2002ans · 10-29-2021, 09:49 PM

Quote:

Originally Posted by caseym54

Eventually I got around to this (and some variations)

(?!tm)([a-z][a-z])l\s or maybe a quote.

You may want to change that \s -> \b.

\s = "any space character"
\b = "Word Boundary" = The "beginning" or the "end" of a word

so this regex:

l\s = A word that ends in an 'l', followed by a space
l\b = A word that ends in an 'l', followed by any non-word character (a space, period, comma, colon, quotation mark, bracket, etc.).

(For more info on \b, see Regular-Expressions.info: "Word Boundaries".)

Anyway, to tackle the "l exclamation point" error, I would probably handle it this way:

Finding Lowercase L Words

In Calibre:

Method A. Tools > Check Spelling.

You can use whatever search criteria you need. ("Show only misspelled words", etc.)

Then you can highlight all the words (Ctrl+A) + Right-Click > "Copy Selected Words to Clipboard":

Click image for larger version

Name: Calibre.Spellcheck.-.Right-Click.Copy.Selected.Words.png
Views: 146
Size: 14.4 KB
ID: 189909

Method B. Tools > Reports > Words.

Press the "Save" button in the bottom right. Then you can save a CSV file:

Click image for larger version

Name: Calibre.Reports.-.Words.png
Views: 139
Size: 22.4 KB
ID: 189910

From there, you can export to another program (like Notepad++ or LibreOffice Calc), where you can run regex or do more analysis.

Side Note: I believe Sigil will be getting more CSV/export functionality in the future.

* * *

I ran Method A on a 130k word book:

237 "misspelled words" had a lowercase 'l' inside.
Only 25 ended with a lowercase 'l'.

Code:

Bobbs-Merrill
Bucknell
Jouvenel
Kozol
Kristol
Mandel
Passell
Samual
Shaull
Stargell
Wittfogel
Wohl
al
calculational
eft-liberal
marshall
nonexponential
nonideological
nonrenewal
ntil
pre-Civil
preindustrial
proindustrial
quotal
warall

Now that list is MUCH easier to look through.

In an instant, you can tell most of these are just people's names.

Then you can see:

"eft-liberal" + "ntil" = missing first letter.
- This ebook had "dropcap" first letter of chapter.
"al" = "et al."
- Common in Non-Fiction/bibliographies. Latin for "and others".
The rest are spelled correctly.
- Except "warall", which was an actual typo (missing an EM DASH between).

This method should catch most of that "l exclamation point" error.

- - - - -

Side Note: Finding Words Ending With Lowercase L

After getting the list of words out of Calibre...

This is the regex I use in Notepad++:

Search: ^(.+)(l)$
Replace: #\1\2

In English, this searches for:

^ = Beginning of line
.+ = One or more of any characters
l = the letter lowercase L
$ = End of line

replace with a '#' at the beginning of that word:

pre-Civil -> #pre-Civil
calculational -> #calculational

Then I sort alphabetically, and poof, all "words with a #" appear up top.

- - - - -

Usage Note: When I ran Method A on "all words":

4126 words had a lowercase 'l'.
549 ended with a lowercase 'l'.

Here's a piece:

Spoiler:

Still reasonable to look through, but you can see how you'd have to have the perfect storm of:

1. A word that is correctly spelled without an 'l'.
2. The 'l' -> exclamation point error occurring.
3. The word also correctly spelled with an extra 'l'.

You can see how rare it would be to land in that category. Three such examples would be:

Car -> Carl -> Car!
- Although a lowercase "carl" would show up in the misspelled list. How often is "Car" capitalized + followed by an exclamation?
Capita (as in "per capita") -> Capital -> Capita!
sea -> seal -> sea!

Grammarchecker

From there, you may want to run the text through a grammarchecker... This may be able to catch:

oddly capitalized words in the middle of sentences.
correctly spelled words that don't quite fit.

Example:

Florida has the least COVID cases per capital The administration didn't comment on the latest good news.
- capital -> capita!
- Grammarcheck may hit on "per capital" OR "The" OR point out something odd in this sentence (missing comma, period, etc.).
The boat was on the seal And the car was on the land!
- seal -> sea!

Quote:

Originally Posted by caseym54

(?!tm)([a-z][a-z])l\s or maybe a quote.

Also a good idea if working in Fiction (or heck, even Non-Fiction).

Very likely the "l exclamation point" error will occur before the close quote, so you'd:

Search: l”
Replace: !”

That would catch things like:

“Carl” Alex yelled as he dove out of the street.
“The term is per capital” the statistics professor said. “Every 100,000 people.”

Anyway, those methods would get you 99%+ of the way there, very quickly, without having to check ALL thousands of hits one-by-one-by-one.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Upper to Lower Case Regex - I'm stuck!	Chris_Snow	Sigil	20	11-03-2023 03:54 PM
Replace UPPER with lower?	vr8ce	Editor	4	07-06-2018 09:43 PM
Author in Upper Case, Author Sort in Lower Case?	JohnnyBook	Calibre	5	09-18-2015 09:45 PM
upper case to sentence case conversion	cybmole	Sigil	8	01-20-2011 06:03 AM
Buy Sony PRS-505 Ornamental Plates both lower and upper	pnyc	Flea Market	2	05-24-2009 11:17 AM

08-15-2021, 07:21 AM	#1
Binchen Connoisseur Posts: 57 Karma: 10 Join Date: Jul 2021 Device: Abakus	Spellcheck filter - upper/lower case The filter in the spellcheck does nor distinguish between upper and lower case, searching for U shows all words with u as well. Bug or feature? Binchen

08-15-2021, 08:44 AM	#3
Turtle91 A Hairy Wizard Posts: 3,094 Karma: 18727053 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	In the find/replace section there is an option for Case Sensitive searches. Is it possible to have a checkbox to turn that on or off in the spellcheck function? I don't know of many words in English that would be mis-spelled with a capital that wouldn't also be mis-spelled as lower case, but I am certainly not fluent in all the languages available to Sigil. Also, as a work-around, the user could certainly use the find/replace function to find words with an aberrant caPital.

08-15-2021, 09:34 AM	#4
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Actually all spellchecking itself can be case sensitive since many languages like German capitalize nouns (not just proper nouns) and they are incorrect if not properly capitalized. The dictionary used determines if capitalization matters, not Sigil.

08-15-2021, 10:03 AM	#5
Turtle91 A Hairy Wizard Posts: 3,094 Karma: 18727053 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	Here's a basic regex that can check for capitals inside a word....although the DOCTYPE statement gives a bunch of matches... find: (?<=\w)[A-Z]

10-23-2021, 04:07 PM	#6
caseym54 Member Posts: 16 Karma: 10 Join Date: Jan 2014 Location: ABQ, NM, USA Device: Kindle Paperwhite 10G	Is there a way to use regex inside the spellcheck search? Example: Often misspelled words than erroneously end in "l" are supposed to have exclamation points. Using the normal search has too many false positives, but after filtering for spelling, it would speed up things. Full regex find/replace would be nice within the spellcheck, too. Is there any plugin that does this?

10-29-2021, 03:26 AM	#8
caseym54 Member Posts: 16 Karma: 10 Join Date: Jan 2014 Location: ABQ, NM, USA Device: Kindle Paperwhite 10G	Thanks for that very helpful reply. Eventually I got around to this (and some variations) (?!tm)([a-z][a-z])l\s or maybe a quote. Problem with"html in headers" \1\2