Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 08-15-2021, 08:21 AM   #1
Binchen
Connoisseur
Binchen began at the beginning.
 
Posts: 56
Karma: 10
Join Date: Jul 2021
Device: Abakus
Spellcheck filter - upper/lower case

The filter in the spellcheck does nor distinguish between upper and lower case, searching for U shows all words with u as well.

Bug or feature?

Binchen
Binchen is offline   Reply With Quote
Old 08-15-2021, 08:38 AM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 5,855
Karma: 3571874
Join Date: Nov 2009
Device: many
Not a bug. It works as designed. The code, just like other filters used in Sigil, is purposely made case insensitive via a lower casing.

Code:
void SpellcheckEditor::FilterEditTextChangedSlot(const QString &text)
{
    const QString lowercaseText = text.toLower();
    QModelIndex root_index = m_SpellcheckEditorModel->indexFromItem(m_SpellcheckEditorModel->invisibleRootItem());

    for (int row = 0; row < m_SpellcheckEditorModel->invisibleRootItem()->rowCount(); row++) {
        QStandardItem *item = m_SpellcheckEditorModel->item(row, 0);
        bool hidden = !(text.isEmpty() || item->text().toLower().contains(lowercaseText));
        ui.SpellcheckEditorTree->setRowHidden(item->row(), root_index, hidden);
    }
}

Last edited by KevinH; 08-15-2021 at 08:46 AM.
KevinH is offline   Reply With Quote
Old 08-15-2021, 09:44 AM   #3
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 2,350
Karma: 13611111
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2 & Air/Surface Pro/Kindle PW
In the find/replace section there is an option for Case Sensitive searches. Is it possible to have a checkbox to turn that on or off in the spellcheck function?

I don't know of many words in English that would be mis-spelled with a capital that wouldn't also be mis-spelled as lower case, but I am certainly not fluent in all the languages available to Sigil. Also, as a work-around, the user could certainly use the find/replace function to find words with an aberrant caPital.
Turtle91 is offline   Reply With Quote
Old 08-15-2021, 10:34 AM   #4
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 5,855
Karma: 3571874
Join Date: Nov 2009
Device: many
Actually all spellchecking itself can be case sensitive since many languages like German capitalize nouns (not just proper nouns) and they are incorrect if not properly capitalized. The dictionary used determines if capitalization matters, not Sigil.
KevinH is offline   Reply With Quote
Old 08-15-2021, 11:03 AM   #5
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 2,350
Karma: 13611111
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2 & Air/Surface Pro/Kindle PW
Here's a basic regex that can check for capitals inside a word....although the DOCTYPE statement gives a bunch of matches...

find: (?<=\w)[A-Z]
Turtle91 is offline   Reply With Quote
Old 10-23-2021, 05:07 PM   #6
caseym54
Junior Member
caseym54 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2014
Device: Kindle Paperwhite 2nd gen
Is there a way to use regex inside the spellcheck search?

Example: Often misspelled words than erroneously end in "l" are supposed to have exclamation points. Using the normal search has too many false positives, but after filtering for spelling, it would speed up things. Full regex find/replace would be nice within the spellcheck, too.

Is there any plugin that does this?
caseym54 is offline   Reply With Quote
Old 10-23-2021, 06:23 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,948
Karma: 8877587
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by caseym54 View Post
Is there a way to use regex inside the spellcheck search?

Example: Often misspelled words than erroneously end in "l" are supposed to have exclamation points.
The way that I currently handle this is in multiple passes.

In Tools > Spellcheck > Spellcheck (Alt+Q):

1. I type in lowercase 'l' (or whatever letter I'm looking for).

Then I toggle the "Show All Words" checkbox.

Very likely those "l instead of !" words will appear in the "misspelled words" spellcheck list.

- - -

Side Note: In Calibre's Spellcheck List, there's also a "Case sensitive search" checkbox.

Extremely helpful in this case, because you don't want capital 'L' words clogging up your list.

- - -

2. After I correct, I toggle the checkbox again (so all "correctly spelled words"), then scroll through and see if I can spot any oddities.

3. Then one more pass at the "misspelled" list.

Note: I do a similar passes with 'l' -> '1' or 'o' -> '0' OCR errors like:

Code:
l98l
198o
196os
h0wever
That's one of the reasons why I requested Spellcheck Lists to support numbers back in 2017.

Calibre has always supported numbers. In Sigil, you need to enable it in Edit > Preferences > Spellcheck Dictionaries > "Check Numbers" (tiny checkbox in the very upper right corner).

Side Note #2: You can also use a similar trick to catch accidental/inconsistent hyphens.

I wrote about it in 2013!

Quote:
Originally Posted by caseym54 View Post
Using the normal search has too many false positives, but after filtering for spelling, it would speed up things. Full regex find/replace would be nice within the spellcheck, too.

Is there any plugin that does this?
No, but I did recently discuss/brainstorm an "Advanced Find/Replace" concept about a month ago in a random Calibre topic.

So you'd have a Spellcheck List-type menu with:
  • Find
  • Replace
  • Filter

and 3 sortable columns:
  • Found
  • Replace
  • # Hits

You'd be able to selectively apply Find/Replace ONLY on specific rows.

(Currently, you can only Find/Replace one-by-one OR Replace All... Similar to the slowness of spellchecking/grammarchecking documents one-by-one vs. mass checking in list form!)

The Technical Details

Here's the relevant posts from that thread:

Spoiler:
Quote:
Originally Posted by Tex2002ans View Post
2. A tool like Bulk Rename Utility allows you to mass search/replace filenames:

Attachment 189406

You fill out your parameters below.

Then you select which files you want to apply it to (Ctrl+Click/Shift+Click).

It puts green highlight on the files that'll actually change, and shows you the before/after in 2 columns.
Quote:
Originally Posted by Tex2002ans View Post
I also believe this would be helpful in the normal large Find/Replaces (with a handful of edge cases).

Like this thread. A giant Find/Replace to switch all "123" -> "spelled-out numbers" form.

100 replaces were fine:
  • Chapter 21 -> Chapter Twenty-One
  • I was 2 years old -> I was two years old

[...]

A Sortable/Searchable (List-Based?) Differ (Advanced Find/Replace?)

When the amount of changes are overwhelming (in the hundreds/thousands).

Similar to the Spellcheck List, you'd be able to type in a:

- Find
- Replace

Run this on a book (like pressing "Count All") and generate a list:

- Find: Chapter \d+

You'd get a list of all hits:

Code:
Found       |  Replace   |  Hits
Chapter 1   |            |     1
Chapter 2   |            |     1
Chapter 3   |            |     1
Chapter 4   |            |     1
[...]
Chapter 100 |            |     1
You'd be able to double-click on any entry and jump to its location.

And, similar to the Spellcheck List, you can search/sort through this:

- Search: 1

Code:
Found       |  Replace   |  Hits
Chapter 1   |            |     1
Chapter 10  |            |     1
Chapter 11  |            |     1
Chapter 12  |            |     1
[...]
Chapter 100 |            |     1
- Search: 10

Code:
Found       |  Replace   |  Hits
Chapter 10  |            |     1
Chapter 100 |            |     1
You'd also be able to do a Replace:

- Find: Chapter (\d+)
- Replace: Chap. \1

Code:
Found       |  Replace   |  Hits
Chapter 1   |  Chap. 1   |     1
Chapter 2   |  Chap. 2   |     1
Chapter 3   |  Chap. 3   |     1
Chapter 4   |  Chap. 4   |     1
[...]
Chapter 100 |  Chap. 100 |     1
Here, I can also scroll through the list and accept/reject certain replaces.

Maybe, sorting by Hits, there would be a:

Code:
Chapter 5   |  Chap. 5   |     5
so you scratch your head, take a closer look, and maybe the book has a few:
  • See Chapter 5 for more information.

You may want to treat that differently than:
  • <h2>Chapter 5</h2>

so you'd apply the change to all 99 other replaces first, then you can dig in to that oddity in more detail.


And then over the following days, I discussed even better use-cases + concepts via PMs:

PM #1

You see that "Chapter (\d+)" example I gave?

Anyway, when I woke up, I thought of few other cases where I'd find that type of workflow extremely useful.

One search/replace where you do thousands, but want to deny a few exceptions, is EN DASHES:

* * *

Current Method

What I typically do is this:

Search: (\d+)-(\d+)
Replace: \1–\2

but then I have to be very careful with URLs, ISBNs, etc.

So what I currently do is split it into separate, smaller steps.
  • Anything with a "pp." or "p." before it? Replace All.
  • Open up the Index? "Current File" -> Replace All.
  • Then I go through step-by-step and have to manually do the rest.
    • (Or I "hack" the Spellcheck List with numbers, then search for a hyphen to see what I'm looking at. :P)
    • If I catch any oddities at that step, I make sure to NOT Replace All, and may tackle chapters one-at-a-time with "Current File".

* * *

Sortable Advanced Find & Replace List

I could do something like:

Search: \b(p+)\.* (\d+)-(\d+)
Replace: pp. \2–\3

Running that, you'd get a giant sortable list of:

Code:
pp. 123-125
pp. 125-127
p 123-125
p. 125-127
pp 130-135
pp. 123-5
pp 125-7
so, at a glance, I can see already see errors in the book (missing periods + some not in 3-digit form).

Then I'd be able to do multiple passes:

Filter: \.

Code:
pp. 123-125
pp. 125-127
p. 125-127 <--- Single "p." error
pp. 123-5 <--- Inconsistency
Great. Replace the first 3. Then double-click on the "pp. 123-5" and/or manually correct to "pp. 123–125".

Blank the Filter, and now I'm left with:

Code:
p 123-125
pp 130-135
pp 125-7 <--- Inconsistency
Again, the first 2 can be replaced, but the 3rd one needs the 3-digit form.

Being able to see some Advanced "Count All", at a glance, in a sortable list... I think this would be some ultimate power move. :P

(Although yes, yes,... what if someone puts in some insane Regex that grabs entire paragraphs... how would that get shoved/displayed in the lists... lol.)

Anyway, I don't currently know of any tool that does this. As I explained in those Calibre posts, I see bits and pieces here and there, but nothing that displays them in easy-to-read lists like the Spellcheck Lists!

* * *

Side Note:

Non-Linear Editing

Another fantastic thing I've been doing lately is editing using Regex.

1. A common thing in Fiction is "creative dialogue tags".

Instead of saying "said", authors may write things like:
  • opined
  • accused
  • agreed
  • beseeched

So what I've been doing is similar to this Regex:

Search: ,” \b(Alex|Bob|Joanne|Suzie|s*he|they)\b (\w+)
Replace: ,” \1 said

Code:
Found              |  Replace       |  Hits
,” Alex opined     | ,” Alex said   |    10
,” Suzie accused   | ,” Suzie said  |     9
,” Joanne agreed   | ,” Joanne said |     4
,” she beseeched   | ,” she said    |     1
2. Or Normalizing "said he" -> "he said"

Search: ,” (said) \b(Alex|Bob|Joanne|Suzie|s*he|they)\b
Replace: ,” \2 \1

Being able to run a Regex like that across an entire book, see a generated list of all usages... it would be GLORIOUS.

3. Or something like:

Search: ([!\?]”) ([A-Z])(\w+) (\w+)
Replace: \1 \L\2\3 \4

to catch accidentally capitalized letters after '!' or '?' when they should be lowercase! Example:
  • “What did you say?” He asked.
  • “What did you say?” he asked.
  • “Time to die!” He yelled.
  • “Time to die!” he yelled.
  • “Attack! Fight for your life!” Alex jumped onto the ship, swinging his sword.
    • The vast majority fall into this "don't change" category.

* * * * * * * * *

PM #3

[...]

Or something similar to that image I showed in Bulk Rename Utility.

You'd have Before/After columns.

Out of those rows, you select which ones you want to apply to. It highlights those rows different, so then you can see what the heck it'll actually change it to.

If you're satisfied, then you press the button and it mass replaces those.

So let's say I run something like:

Search: (\w+)</p>\s+<p>([a-z])
Replace: \1 \2

you'd get a giant list of:

Code:
Before                           | After
_________________________________|___________________
And how</p>                      | And how are you?
<p>are you?                      |
                                 |
And another</p>                  | And another one is here.</p>
<p>one is here.</p>              |
Heh, but kind of like I mentioned in that PM to KevinH... no idea how to display list-forms when the person shoves in huge regex (like capturing entire HTML files or enormous paragraphs).

Last edited by Tex2002ans; 10-23-2021 at 09:53 PM.
Tex2002ans is offline   Reply With Quote
Old 10-29-2021, 04:26 AM   #8
caseym54
Junior Member
caseym54 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2014
Device: Kindle Paperwhite 2nd gen
Thanks for that very helpful reply.

Eventually I got around to this (and some variations)

(?!tm)([a-z][a-z])l\s or maybe a quote. Problem with"html in headers"
\1\2
caseym54 is offline   Reply With Quote
Old 10-29-2021, 10:49 PM   #9
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,948
Karma: 8877587
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by caseym54 View Post
Eventually I got around to this (and some variations)

(?!tm)([a-z][a-z])l\s or maybe a quote.
You may want to change that \s -> \b.
  • \s = "any space character"
  • \b = "Word Boundary" = The "beginning" or the "end" of a word

so this regex:
  • l\s = A word that ends in an 'l', followed by a space
  • l\b = A word that ends in an 'l', followed by any non-word character (a space, period, comma, colon, quotation mark, bracket, etc.).

(For more info on \b, see Regular-Expressions.info: "Word Boundaries".)

Anyway, to tackle the "l exclamation point" error, I would probably handle it this way:

Finding Lowercase L Words

In Calibre:

Method A. Tools > Check Spelling.

You can use whatever search criteria you need. ("Show only misspelled words", etc.)

Then you can highlight all the words (Ctrl+A) + Right-Click > "Copy Selected Words to Clipboard":

Click image for larger version

Name:	Calibre.Spellcheck.-.Right-Click.Copy.Selected.Words.png
Views:	31
Size:	14.4 KB
ID:	189909

Method B. Tools > Reports > Words.

Press the "Save" button in the bottom right. Then you can save a CSV file:

Click image for larger version

Name:	Calibre.Reports.-.Words.png
Views:	29
Size:	22.4 KB
ID:	189910

From there, you can export to another program (like Notepad++ or LibreOffice Calc), where you can run regex or do more analysis.

Side Note: I believe Sigil will be getting more CSV/export functionality in the future.

* * *

I ran Method A on a 130k word book:
  • 237 "misspelled words" had a lowercase 'l' inside.
  • Only 25 ended with a lowercase 'l'.

Code:
Bobbs-Merrill
Bucknell
Jouvenel
Kozol
Kristol
Mandel
Passell
Samual
Shaull
Stargell
Wittfogel
Wohl
al
calculational
eft-liberal
marshall
nonexponential
nonideological
nonrenewal
ntil
pre-Civil
preindustrial
proindustrial
quotal
warall
Now that list is MUCH easier to look through.

In an instant, you can tell most of these are just people's names.

Then you can see:
  • "eft-liberal" + "ntil" = missing first letter.
    • This ebook had "dropcap" first letter of chapter.
  • "al" = "et al."
    • Common in Non-Fiction/bibliographies. Latin for "and others".
  • The rest are spelled correctly.
    • Except "warall", which was an actual typo (missing an EM DASH between).

This method should catch most of that "l exclamation point" error.

- - - - -

Side Note: Finding Words Ending With Lowercase L

After getting the list of words out of Calibre...

This is the regex I use in Notepad++:

Search: ^(.+)(l)$
Replace: #\1\2

In English, this searches for:
  • ^ = Beginning of line
  • .+ = One or more of any characters
  • l = the letter lowercase L
  • $ = End of line

replace with a '#' at the beginning of that word:

pre-Civil -> #pre-Civil
calculational -> #calculational

Then I sort alphabetically, and poof, all "words with a #" appear up top.

- - - - -

Usage Note: When I ran Method A on "all words":
  • 4126 words had a lowercase 'l'.
  • 549 ended with a lowercase 'l'.

Here's a piece:

Spoiler:
Code:
Agricultural
All
Annual
Appeal
April
Baikal
Bail
Bengal
Bill
Bobbs-Merrill
Bucknell
Caldwell
Canal
Capital
Carl
Causal
Central
Chapel
Civil
Classical
Colonial
Commercial
[...]


Still reasonable to look through, but you can see how you'd have to have the perfect storm of:

1. A word that is correctly spelled without an 'l'.
2. The 'l' -> exclamation point error occurring.
3. The word also correctly spelled with an extra 'l'.

You can see how rare it would be to land in that category. Three such examples would be:
  • Car -> Carl -> Car!
    • Although a lowercase "carl" would show up in the misspelled list. How often is "Car" capitalized + followed by an exclamation?
  • Capita (as in "per capita") -> Capital -> Capita!
  • sea -> seal -> sea!

Grammarchecker

From there, you may want to run the text through a grammarchecker... This may be able to catch:
  • oddly capitalized words in the middle of sentences.
  • correctly spelled words that don't quite fit.

Example:
  • Florida has the least COVID cases per capital The administration didn't comment on the latest good news.
    • capital -> capita!
    • Grammarcheck may hit on "per capital" OR "The" OR point out something odd in this sentence (missing comma, period, etc.).
  • The boat was on the seal And the car was on the land!
    • seal -> sea!

Quote:
Originally Posted by caseym54 View Post
(?!tm)([a-z][a-z])l\s or maybe a quote.
Also a good idea if working in Fiction (or heck, even Non-Fiction).

Very likely the "l exclamation point" error will occur before the close quote, so you'd:

Search: l”
Replace: !”

That would catch things like:
  • “Carl” Alex yelled as he dove out of the street.
  • “The term is per capital” the statistics professor said. “Every 100,000 people.”

Anyway, those methods would get you 99%+ of the way there, very quickly, without having to check ALL thousands of hits one-by-one-by-one.

Last edited by Tex2002ans; 10-29-2021 at 11:25 PM.
Tex2002ans is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Replace UPPER with lower? vr8ce Editor 4 07-06-2018 10:43 PM
Author in Upper Case, Author Sort in Lower Case? JohnnyBook Calibre 5 09-18-2015 10:45 PM
Upper to Lower Case Regex - I'm stuck! Chris_Snow Sigil 13 09-23-2014 09:34 PM
upper case to sentence case conversion cybmole Sigil 8 01-20-2011 07:03 AM
Buy Sony PRS-505 Ornamental Plates both lower and upper pnyc Flea Market 2 05-24-2009 12:17 PM


All times are GMT -4. The time now is 12:09 AM.


MobileRead.com is a privately owned, operated and funded community.