View Single Post
Old 02-11-2023, 04:30 PM   #1
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 777
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
Finding Series of Capitalized Words?

Well, I thought I remembered a thread here about finding sequences of all capitalized words. But, I can't find it. So, I'll just start one.

In the Calibre Editor, I'm trying to find and select those sequences of all capitalized words that publishers sometimes stick in the first line of a chapter (or perhaps after some kind of scene break). Here's my current best shot (the Case-Sensitive box has to be checked for this):
Code:
([A-Z0-9]+(?:\s[A-Z0-9\.,…’“”!?—-]+)+\b)
This is what I think that does:
  • [A-Z0-9]+ - The sequence of words should start with either an all cap letter or a digit. Keep matching captial letters or digits until you can't. ASSUME THAT'S A WORD.
  • (?:\s[A-Z0-9\.,…’“”!?—-]+) - Non-Capturing Group saying to keep going through a space (assumes that's where the first word ended) and then one of capital letter, digit, or various punctuation. Repeat going through those same tokens 1 or more times until you hit something not in that list (most likely a space), i.e., look for more words.
  • + - Repeat the Non-Capturing Group 1 or more times finding other such words until we hit a non-capitalized letter, a non-digit or some, so far, unspecified punctuation. IOW, we've run out of the sequence of capitalized words.
  • \b - End on a word boundary. Otherwise, the selection could continue on through to a single capitalized letter in a following multi-case word.
This MOSTLY works. The problem is that if the first word ends with a punctuation mark instead of a space, it doesn't get picked up. If I change the selector for the first word to include punctuation, then it picks up things like ". C" and ", C" as the START of the sequence of words. Similarly, if the first word contains punctuation (like an apostrophe), the first part of the word doesn't get picked up. The selection starts after that punctuation.

Can anyone come up with a way to include first words with punctuation? This is better than a sharp stick in the eye (i.e., better than manually finding and retyping the start of every chapter). But, I'd like to work through this.
enuddleyarbl is offline   Reply With Quote