View Single Post
Old 07-11-2016, 05:42 AM   #22
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by AlexBell View Post
Thanks for all your responses - I've been so busy proofreading that I haven't been back on the forum. I obviously have a lot to read through and digest.
You still have to tell us all those errors in your books!

Quote:
Originally Posted by Doitsu View Post
If you want to experiment with the ngram spellcheak feature, you'll need to create a folder with an en subfolder in it and extract the ngram data files to that en folder.
I'll have to do that some time in the future. Will definitely keep your plugin on my radar and run it on old books + see if I can point out any errors that it misses.

Quote:
Maybe more folks can share their "little lists" for the edification of us all.
I have been meaning to put together one of my "lists" for so long. Maybe in the coming weeks I will have to gather the info and actually do something about it this time.

Most of the information I have directly on hand is all of the actual book typos I have come across over the years.

I stopped writing down OCR errors so many years ago, and now could probably only gather them with code comparisons between EPUB versions as I worked on them.

Quote:
1 l I i ! <--> each other
{digit One, lowercase L, uppercase i, lowercase i, exclamation mark}
Speaking of my "I963" -> "1963" example, yesterday I caught "J969". There was a speck of dust in the PDF scan at the bottom left of the "1", which caused it to OCR as "J". It reminded me that I have seen this just due to normal OCR, although it is quite rare.

Quote:
U = double ell, li, il
WeU = Well
Ufe = life
untU = until
Typically when you OCR a book this entire "class" pops up, so you can easily spot it. If this occurs, I typically just put in a capital "U" into Sigil/Calibre Spellcheck.

There probably aren't many actual words in the book with a capital U in them, so they stick out like a sore thumb... especially if you sort the Spellcheck List by Case Sensitive Sort. Anything that starts with a lowercase letter and has an uppercase "U" in it is a mistake 99% of the time.

Side Note: That type of search is better in Calibre's Spellcheck because you can do a Case Sensitive Search.

Quote:
Space following opening quote mark
Space preceding closing quote or punctuation mark.
He did this ; then he did that ; then he said : “ You aren’t ready ! ”
Also want to pay attention to spaces before/after slashes. Quite often an error might creep in such as "and /or" + "and/ or".

Side Note: I even caught this in quite a few InDesign files as well. This is an easy error to slip by even in purely digital files.

Quote:
Apostrophe goes missing, stranding the last letter
I m = I’m, don t = don’t, Bob s = Bob’s
I typically run this Regex to catch all lowercase letters that are by themselves that are not "a":

Search: \s[b-z]\s

Similarly, I run this one too to catch all capital letters that are by themselves that are not "A" or "I":

Search: \s[B-HJ-Z]\s

Those basic Regexes do miss the odd case of that occurring anywhere near an HTML tag though. So it would miss:

<p>B ob said to go outside!</p>

or:

<p><i>Then </i>S uzy told Bob to jump over the fence.</p>

But if the book is riddled with them, then I make sure to look much more closely (and those typically get caught at other passes, or just write up a custom Regex to catch that error).

Side Note: I don't use the capitals one too often because many of the books I work on have text along these lines: "Product C and Product D" + "Person X and Y".

Quote:
Reversed single and double quotes in nested quotations:
“And I said to him, ‘Quit that!”’
‘“O what a tangled web we weave,’” she said.
This is also a Search/Replace that I use:

Search: ‘“
Replace: “‘

Search: ”’
Replace: ’”

Although use those on a case-by-case basis (don't just do a huge Replace All).

Side Note: Quotation marks typically require some scrutiny, because there are a huge amount of actual book typos that have creeped in due to wrong nesting. As a related note, I found that parenthesis + brackets follow the same rules, and also have a relatively large amount of nesting errors. This was an entire class of errors that I missed until I used Toxaris's "Dialogue Check" (Pure Regex is not as good).

Quote:
’ Right single quote should replace "straight" apostrophe, not ‘ Left single quote. Happens often at start of a word:
‘em should be ’em, ‘tis should be ’tis
This is the Regex I use:

Search: ‘(Em|em|Til|til|Tis|tis|Twas|twas)
Replace: ’\1

Related is the RIGHT single quote before shortened years:

Search: ‘([0-9])
Replace: ’\1

or the RIGHT single quote before + after the "n":

Rock ’n’ Roll

Last edited by Tex2002ans; 07-11-2016 at 05:52 AM.
Tex2002ans is offline   Reply With Quote