View Single Post
Old 03-28-2021, 06:44 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by phossler View Post
I think I'll need two passes
Yep, that sounds about right.

Quote:
Originally Posted by retiredbiker View Post
For the "Thomas J. Beale" sorts of case, ([a-z])\. ([A-Z])\. ([A-Z]) with match case checked should work OK. Your other searches look OK, but I'd do "find and replace" throughout the book for any of them rather than "replace all". In your "A. B. Charles" case, I think you want another nbsp to go be tween the B and C in the replace string.
Agreed.

That's ~ the regex I use as well... except I use regex to normalize "First. Middle." into a single chunk:

F. A. Hayek -> F.A. Hayek
W. E. B. Du Bois -> W.E.B. Du Bois

or normalizing states/acronyms/times:

C. A. -> C.A.
N. Y. C. -> N.Y.C.
A. M. -> A.M.

Quote:
Originally Posted by retiredbiker View Post
You are on the right track, but in this case I can't imagine a search where you would want to run a "replace all". The possibilities of non-name text fitting the search are pretty large no matter what. No regex string can tell if a word is a person's name.
Exactly. Needs to be looked at on a case-by-case basis.*

Regex alone is "too dumb". To lower the errors, you'd need something that can actually parse the sentence structure.

Antidote is a grammarchecker, and is the only one I know of that can detect/combine First + Middle + Last Name (along with units + dates/times + [...]).

See their list of space detections:

https://documentation.antidote.info/...s/spaces-panel

Antidote was designed for French first, where "non-breaking thin spaces" are used all over the place around punctuation.

Side Note: I wrote a detailed analysis of Antidote in:

I also discussed a few similar regexes over the years (like ALL CAPS->Smallcaps or Roman Numerals):

Side Note #2: You may also be able to hackishly use Spellcheck Lists:

I explained multiple methods to combine "e m p h a s i s" into "emphasis".

* Note: What I wrote in Post #12 in the topic above still applies:

Quote:
Originally Posted by Tex2002ans View Post
You can use Regex to do the vast bulk of the corrections, then manually fix the edge cases.

Better/faster to do:
  • 95% correct with a 2-step regex.
  • 5% manually find/correct/fix.

than:
  • 100% manually fix.
So it's up to you where you want to spend your time and do your fixing.

Quote:
Originally Posted by phossler View Post
Eventually I'd like to extend this to joining dates. Maybe some others

March 22, 2021 --> Marchnbs22,nbs2021
To detect dates, I use these:

Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. (\d)

Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}),

They can be adjusted as needed.

Last edited by Tex2002ans; 03-28-2021 at 07:32 PM.
Tex2002ans is offline   Reply With Quote