Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 06-21-2014, 05:47 AM   #1
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Suggestion: Spell Check Tool Enhancement

This topic spurred a thought in my brain for an enhancement to the Spell Check Tool:

https://www.mobileread.com/forums/sho...d.php?t=241283

I made a post (#6):

https://www.mobileread.com/forums/sho...51&postcount=6

Long story short, it would be a nice feature enhancement to have a checkbox to allow a Case Sensitive SEARCH.

For example, searching for a capital letter 'O' will allow "Octopus" to be shown in the list, but not "octopus". Searching for a capital 'I' will show "McIver", but not "Mciver".

Having the checkbox off will be the default (current) implementation (not case sensitive).

As I mentioned in that topic, I believe a Case Sensitive Search would be extremely helpful for catching many of these hard to find OCR errors (capital 'I' instead of lowercase 'l', etc. etc....)
Tex2002ans is offline   Reply With Quote
Old 06-21-2014, 09:51 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
https://github.com/kovidgoyal/calibr...1ed242a5a7e6f6
kovidgoyal is offline   Reply With Quote
Old 07-01-2014, 09:59 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Fantastic work Kovid, thank you for implementing this tweak so quickly.

Just used it to catch a bunch of capital 'I' -> 'l', which are a common OCR error:

couId
concIusions
falI
faiI
goodwiII
piIgramage
uItimate
welI
[...]

EXTREMELY helpful addition.
Tex2002ans is offline   Reply With Quote
Old 07-06-2014, 05:42 AM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Hmmm... so today I was fiddling around some more with the Calibre Spell Check tool, and I stumbled across this problem.

The hyphen '-' should be considered a legitimate character for a word. Example of how it currently works:

The word "non-fiction" is seen as two words, "non" and "fiction".
The word "micro-economics" is seen as two words, "micro" and "economics".
The word "anti-establishment" is seen as two words, "anti" and "establishment".

A few reasons why this fix would be extremely useful:

1. I use this ALL THE TIME in Sigil in order to catch usages of non-hyphenated and hyphenated versions of words. It is QUITE a common OCR error, where you might have mixes of "nonfiction" + "non-fiction", "co-operating" + "cooperating", "counter-clockwise" + "counterclockwise", "short-term" + "shortterm" in the same book. These typically then have to be made consistent/normalized throughout the book.

2. It makes it quite helpful to catch accidental hyphens in author's first/last names. For example, "Black-well" -> "Blackwell", "How-den" -> "Howden", "Lach-mann" -> "Lachmann", "Lee-son" -> "Leeson".

Last edited by Tex2002ans; 07-06-2014 at 06:59 AM.
Tex2002ans is offline   Reply With Quote
Old 07-08-2014, 02:10 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
This is a limitation of ICU, it breaks words on hyphens, even though its documentation claims it shouldn't. It is on my TODO list to see if I can implement an efficient workaround.
kovidgoyal is offline   Reply With Quote
Old 07-08-2014, 05:11 AM   #6
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,568
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by kovidgoyal View Post
This is a limitation of ICU, it breaks words on hyphens, even though its documentation claims it shouldn't. It is on my TODO list to see if I can implement an efficient workaround.
Would be nice if you could - Sigil's hyphenated word list is a reason why I stick with it. Today I found Earls-Court and Bays-water in the one document

BR
BetterRed is offline   Reply With Quote
Old 07-08-2014, 07:30 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by BetterRed View Post
Would be nice if you could - Sigil's hyphenated word list is a reason why I stick with it. Today I found Earls-Court and Bays-water in the one document :rolleyes
Currently I am using both at the same time, The Case Sensitive Search is an INCREDIBLE addition, and helps catch a given type of very hard to spot errors, and Sigil's hyphenated list can catch a completely different set of OCR errors/inconsistencies.

I am using Calibre's list to point out/narrow down the errors, and then just doing all my fixes in Sigil.

Suggestion: Another odd thing I noticed in Calibre's Spell Check List is numbers.

I believe that "words" that are completely made of numbers + periods + commas should not be included in the list at all.

I believe the way that Sigil handles it, a "word" with ANY numbers is removed. But after seeing Calibre's list, I still think it is useful if "words" with SOME numbers are still left there. For example, these can then be caught/stand out like sore thumbs:

Seeing these in list form + the amount of times they occur in the book is extremely helpful for spotting inconsistencies.

Perhaps you can safely remove "words" that are FULLY numbers, but still keep the ones that are SOME numbers?

Perhaps it can be another toggle? Include numbers, not include numbers? (Or perhaps this would make the UI too cluttered?).

Side Note: I am currently working on digitizing 12 years of a journal (~ 2 million words). The perfect size to put Calibre's Editor through some serious testing!

Now, all we need is the fantastic Reports functionality to come over to Calibre's Editor.

Last edited by Tex2002ans; 07-08-2014 at 07:41 PM.
Tex2002ans is offline   Reply With Quote
Old 07-08-2014, 08:38 PM   #8
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,907
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by Tex2002ans View Post
Perhaps it can be another toggle? Include numbers, not include numbers? (Or perhaps this would make the UI too cluttered?).
Speaking of the toggles, the current build has them in a fixed line across the bottom. That makes the minimum width of the check spelling dialog about half of my screen. They need to be split over two lines or wrap somehow. Putting the word count and the "Show only mispelled words" on one line and the others on the second would make sense.

And I would like to have a way to ignore the words with numbers. An alternative to an option is to have the "Ignore" button work on multiple words. Select all the words you want to ignore and press the button once. At the moment it seems to work only on the last word selected.
davidfor is offline   Reply With Quote
Old 07-08-2014, 10:09 PM   #9
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,568
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Talking of numbers. If a book has an index, all the page number links are flagged as errors - see attachment. If I ignore all those 'numbers' I get to watch an hourglass for anything up to 10 minutes. Also if a book has a long list of references many of the author names will be flagged as errors.

IMO an index or reference list should be in separate files, and they normally are. So, if one could exclude files from spell checking then one could deal with the index and reference files separately.

Be nice to have the ability to exclude paragraphs too - to avoid checking quotes in the original vernacular - eg Chaucer, Shakespeare etc

BR
Attached Thumbnails
Click image for larger version

Name:	Capture.JPG
Views:	278
Size:	106.3 KB
ID:	125105  

Last edited by BetterRed; 07-08-2014 at 10:11 PM.
BetterRed is offline   Reply With Quote
Old 07-08-2014, 10:26 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by BetterRed View Post
IMO an index or reference list should be in separate files, and they normally are. So, if one could exclude files from spell checking then one could deal with the index and reference files separately.
This is actually a fantastic idea.... ALTHOUGH, Indexes typically have a large amount of typos in my experience, so you may not want to ignore the file completely.

(Typically names spelled wrong, missing accents in names/words, etc. etc.)

Quote:
Originally Posted by BetterRed View Post
Be nice to have the ability to exclude paragraphs too - to avoid checking quotes in the original vernacular - eg Chaucer, Shakespeare etc
Maybe something along the lines of Sigil's "sigil_not_in_toc", maybe you could mark that p or blockquote with a class like "calibre_ignore_spellcheck".
Tex2002ans is offline   Reply With Quote
Old 07-08-2014, 10:40 PM   #11
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@davidfor: You can ignore multiple words by selecting them and right clicking. The buttons only operate on a single word at a time.
kovidgoyal is offline   Reply With Quote
Old 07-08-2014, 10:56 PM   #12
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,568
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by Tex2002ans View Post
This is actually a fantastic idea.... ALTHOUGH, Indexes typically have a large amount of typos in my experience, so you may not want to ignore the file completely.

(Typically names spelled wrong, missing accents in names/words, etc. etc.)
I wouldn't ignore - I'd initially exclude index, reference, content.opf, toc, list of figures etc - ie focus on the body of the book

And then deal with the others in a separate pass(es) and maybe exclude the body of the book. My thinking is that the file exclusions would not persist between sessions.

Quote:
Originally Posted by Tex2002ans View Post
Maybe something along the lines of Sigil's "sigil_not_in_toc", maybe you could mark that p or blockquote with a class like "calibre_ignore_spellcheck".
I hadn't considered how of 'exclude some paragraphs' would be done, for me it would be a nice to have.

Currently I ignore 'misspellings' in Shakespeare et al quotes; but t'would be most felicitous to do otherwise In Word you can exclude blocks from its spell checker I think they persist until you do a spelling check reset on the document.

But I repeat, for me it's a nice to have.

BR
BetterRed is offline   Reply With Quote
Old 07-08-2014, 11:01 PM   #13
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by kovidgoyal View Post
This is a limitation of ICU, it breaks words on hyphens, even though its documentation claims it shouldn't. It is on my TODO list to see if I can implement an efficient workaround.
Fixed! https://github.com/kovidgoyal/calibr...4d019425b20677
eschwartz is offline   Reply With Quote
Old 07-08-2014, 11:21 PM   #14
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
While we are discussing minor tweaks to Spellcheck:

Suggestion: Possible to ignore spellcheck of text between links:

Quote:
For additional discussion, see Sowell (2004). The Forbes 400 can be found online at <a href="http://www.forbes.com/lists/2006/54/biz_06rich400_The-400-Richest-Americans_Rank.html">http://www.forbes.com/lists/2006/54/biz_06rich400_The-400-Richest-Americans_Rank.html</a>. It is also notable that many of the members of the Forbes 400 are also innovators in service-oriented businesses.
Again, maybe a toggle (or I am really liking this class idea)? hahaha
Tex2002ans is offline   Reply With Quote
Old 07-08-2014, 11:57 PM   #15
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,907
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by kovidgoyal View Post
@davidfor: You can ignore multiple words by selecting them and right clicking. The buttons only operate on a single word at a time.
Yes I know. But, I've had cases where there were group of words I wanted to ignore. I'm just being lazy
davidfor is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spell Check Suggestion Tex2002ans Sigil 19 01-10-2013 08:45 PM
Spell Check GeckoFriend Sigil 5 06-15-2012 03:09 PM
how to use spell check richreads Sigil 2 01-24-2012 10:13 PM
Disable spell check? mariel9898 Nook Developer's Corner 0 03-26-2011 09:49 AM
Enhancement suggestion. moggie Calibre 1 01-01-2009 01:35 PM


All times are GMT -4. The time now is 10:01 AM.


MobileRead.com is a privately owned, operated and funded community.