Quote:
Originally Posted by AlexBell
Thanks for all your responses - I've been so busy proofreading that I haven't been back on the forum. I obviously have a lot to read through and digest.
|
You still have to tell us all those errors in your books!
Quote:
Originally Posted by Doitsu
If you want to experiment with the ngram spellcheak feature, you'll need to create a folder with an en subfolder in it and extract the ngram data files to that en folder.
|
I'll have to do that some time in the future. Will definitely keep your plugin on my radar and run it on old books + see if I can point out any errors that it misses.
Quote:
Maybe more folks can share their "little lists" for the edification of us all.
|
I have been meaning to put together one of my "lists" for so long. Maybe in the coming weeks I will have to gather the info and actually do something about it this time.
Most of the information I have directly on hand is all of the actual book typos I have come across over the years.
I stopped writing down OCR errors so many years ago, and now could probably only gather them with code comparisons between EPUB versions as I worked on them.
Quote:
1 l I i ! <--> each other
{digit One, lowercase L, uppercase i, lowercase i, exclamation mark}
|
Speaking of my "I963" -> "1963" example, yesterday I caught "J969". There was a speck of dust in the PDF scan at the bottom left of the "1", which caused it to OCR as "J". It reminded me that I have seen this just due to normal OCR, although it is
quite rare.
Quote:
U = double ell, li, il
WeU = Well
Ufe = life
untU = until
|
Typically when you OCR a book this entire "class" pops up, so you can easily spot it. If this occurs, I typically just put in a capital "U" into Sigil/Calibre Spellcheck.
There probably aren't many actual words in the book with a capital U in them, so they stick out like a sore thumb... especially if you sort the Spellcheck List by Case Sensitive Sort. Anything that
starts with a lowercase letter and has an uppercase "U" in it is a mistake 99% of the time.
Side Note: That type of search is better in Calibre's Spellcheck because you can do a Case Sensitive Search.
Quote:
Space following opening quote mark
Space preceding closing quote or punctuation mark.
He did this ; then he did that ; then he said : “ You aren’t ready ! ”
|
Also want to pay attention to spaces before/after slashes. Quite often an error might creep in such as "and /or" + "and/ or".
Side Note: I even caught this in quite a few InDesign files as well. This is an easy error to slip by even in purely digital files.
Quote:
Apostrophe goes missing, stranding the last letter
I m = I’m, don t = don’t, Bob s = Bob’s
|
I typically run this Regex to catch all lowercase letters that are by themselves that are not "a":
Search: \s[b-z]\s
Similarly, I run this one too to catch all capital letters that are by themselves that are not "A" or "I":
Search: \s[B-HJ-Z]\s
Those basic Regexes do miss the odd case of that occurring anywhere near an HTML tag though. So it would miss:
<p>B ob said to go outside!</p>
or:
<p><i>Then </i>S uzy told Bob to jump over the fence.</p>
But if the book is riddled with them, then I make sure to look much more closely (and those typically get caught at other passes, or just write up a custom Regex to catch that error).
Side Note: I don't use the capitals one too often because many of the books I work on have text along these lines: "Product C and Product D" + "Person X and Y".
Quote:
Reversed single and double quotes in nested quotations:
“And I said to him, ‘Quit that!”’
‘“O what a tangled web we weave,’” she said.
|
This is also a Search/Replace that I use:
Search: ‘“
Replace: “‘
Search: ”’
Replace: ’”
Although use those on a case-by-case basis (don't just do a huge Replace All).
Side Note: Quotation marks typically require some scrutiny, because there are a huge amount of actual book typos that have creeped in due to wrong nesting. As a related note, I found that parenthesis + brackets follow the same rules, and also have a relatively large amount of nesting errors. This was an entire class of errors that I missed until I used Toxaris's "Dialogue Check" (Pure Regex is not as good).
Quote:
’ Right single quote should replace "straight" apostrophe, not ‘ Left single quote. Happens often at start of a word:
‘em should be ’em, ‘tis should be ’tis
|
This is the Regex I use:
Search: ‘(Em|em|Til|til|Tis|tis|Twas|twas)
Replace: ’\1
Related is the RIGHT single quote before shortened years:
Search: ‘([0-9])
Replace: ’\1
or the RIGHT single quote before + after the "n":
Rock ’n’ Roll