Quote:
Originally Posted by phossler
I think I'll need two passes
|
Yep, that sounds about right.
Quote:
Originally Posted by retiredbiker
For the "Thomas J. Beale" sorts of case, ([a-z])\. ([A-Z])\. ([A-Z]) with match case checked should work OK. Your other searches look OK, but I'd do "find and replace" throughout the book for any of them rather than "replace all". In your "A. B. Charles" case, I think you want another nbsp to go be tween the B and C in the replace string.
|
Agreed.
That's ~ the regex I use as well... except I use regex to normalize "First. Middle." into a single chunk:
F. A. Hayek -> F.A. Hayek
W. E. B. Du Bois -> W.E.B. Du Bois
or normalizing states/acronyms/times:
C. A. -> C.A.
N. Y. C. -> N.Y.C.
A. M. -> A.M.
Quote:
Originally Posted by retiredbiker
You are on the right track, but in this case I can't imagine a search where you would want to run a "replace all". The possibilities of non-name text fitting the search are pretty large no matter what. No regex string can tell if a word is a person's name.
|
Exactly. Needs to be looked at on a case-by-case basis.*
Regex alone is "too dumb". To lower the errors, you'd need something that can actually parse the sentence structure.
Antidote is a grammarchecker, and is the only one I know of that can detect/combine First + Middle + Last Name (along with units + dates/times + [...]).
See their list of space detections:
https://documentation.antidote.info/...s/spaces-panel
Antidote was designed for French first, where "non-breaking thin spaces" are used all over the place around punctuation.
Side Note: I wrote a detailed analysis of Antidote in:
I also discussed a few similar regexes over the years (like ALL CAPS->Smallcaps or Roman Numerals):
Side Note #2: You may also be able to hackishly use Spellcheck Lists:
I explained multiple methods to combine "e m p h a s i s" into "emphasis".
* Note: What I wrote in Post #12 in the topic above still applies:
Quote:
Originally Posted by Tex2002ans
You can use Regex to do the vast bulk of the corrections, then manually fix the edge cases.
Better/faster to do:
- 95% correct with a 2-step regex.
- 5% manually find/correct/fix.
than:
|
So it's up to you where you want to spend your time and do your fixing.
Quote:
Originally Posted by phossler
Eventually I'd like to extend this to joining dates. Maybe some others
March 22, 2021 --> Marchnbs22,nbs2021
|
To detect dates, I use these:
Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. (\d)
Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}),
They can be adjusted as needed.