MobileRead Forums - View Single Post

Tex2002ans · 03-28-2021, 06:44 PM

Quote:

Originally Posted by phossler

I think I'll need two passes

Yep, that sounds about right.

Quote:

Originally Posted by retiredbiker

For the "Thomas J. Beale" sorts of case, ([a-z])\. ([A-Z])\. ([A-Z]) with match case checked should work OK. Your other searches look OK, but I'd do "find and replace" throughout the book for any of them rather than "replace all". In your "A. B. Charles" case, I think you want another nbsp to go be tween the B and C in the replace string.

Agreed.

That's ~ the regex I use as well... except I use regex to normalize "First. Middle." into a single chunk:

F. A. Hayek -> F.A. Hayek
W. E. B. Du Bois -> W.E.B. Du Bois

or normalizing states/acronyms/times:

C. A. -> C.A.
N. Y. C. -> N.Y.C.
A. M. -> A.M.

Quote:

Originally Posted by retiredbiker

You are on the right track, but in this case I can't imagine a search where you would want to run a "replace all". The possibilities of non-name text fitting the search are pretty large no matter what. No regex string can tell if a word is a person's name.

Exactly. Needs to be looked at on a case-by-case basis.*

Regex alone is "too dumb". To lower the errors, you'd need something that can actually parse the sentence structure.

Antidote is a grammarchecker, and is the only one I know of that can detect/combine First + Middle + Last Name (along with units + dates/times + [...]).

See their list of space detections:

https://documentation.antidote.info/...s/spaces-panel

Antidote was designed for French first, where "non-breaking thin spaces" are used all over the place around punctuation.

Side Note: I wrote a detailed analysis of Antidote in:

"Does Tool Exist to Spellcheck/Grammarcheck by Category?"

I also discussed a few similar regexes over the years (like ALL CAPS->Smallcaps or Roman Numerals):

"Alternate glyph support (font-variant-alternates)" (Posts #36 + #38)
"Regex examples" (Posts #588 + #590)

Side Note #2: You may also be able to hackishly use Spellcheck Lists:

Regular expression for removing blanks between letters (Posts #5 + #12)

I explained multiple methods to combine "e m p h a s i s" into "emphasis".

* Note: What I wrote in Post #12 in the topic above still applies:

Quote:

Originally Posted by Tex2002ans

You can use Regex to do the vast bulk of the corrections, then manually fix the edge cases.

Better/faster to do:

95% correct with a 2-step regex.
5% manually find/correct/fix.

than:

100% manually fix.

So it's up to you where you want to spend your time and do your fixing.

Quote:

Originally Posted by phossler

Eventually I'd like to extend this to joining dates. Maybe some others

March 22, 2021 --> Marchnbs22,nbs2021

To detect dates, I use these:

Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. (\d)

Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}),

They can be adjusted as needed.