Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 02-11-2023, 04:30 PM   #1
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 734
Karma: 1077122
Join Date: Sep 2013
Device: Kobo Forma
Finding Series of Capitalized Words?

Well, I thought I remembered a thread here about finding sequences of all capitalized words. But, I can't find it. So, I'll just start one.

In the Calibre Editor, I'm trying to find and select those sequences of all capitalized words that publishers sometimes stick in the first line of a chapter (or perhaps after some kind of scene break). Here's my current best shot (the Case-Sensitive box has to be checked for this):
Code:
([A-Z0-9]+(?:\s[A-Z0-9\.,…’“”!?—-]+)+\b)
This is what I think that does:
  • [A-Z0-9]+ - The sequence of words should start with either an all cap letter or a digit. Keep matching captial letters or digits until you can't. ASSUME THAT'S A WORD.
  • (?:\s[A-Z0-9\.,…’“”!?—-]+) - Non-Capturing Group saying to keep going through a space (assumes that's where the first word ended) and then one of capital letter, digit, or various punctuation. Repeat going through those same tokens 1 or more times until you hit something not in that list (most likely a space), i.e., look for more words.
  • + - Repeat the Non-Capturing Group 1 or more times finding other such words until we hit a non-capitalized letter, a non-digit or some, so far, unspecified punctuation. IOW, we've run out of the sequence of capitalized words.
  • \b - End on a word boundary. Otherwise, the selection could continue on through to a single capitalized letter in a following multi-case word.
This MOSTLY works. The problem is that if the first word ends with a punctuation mark instead of a space, it doesn't get picked up. If I change the selector for the first word to include punctuation, then it picks up things like ". C" and ", C" as the START of the sequence of words. Similarly, if the first word contains punctuation (like an apostrophe), the first part of the word doesn't get picked up. The selection starts after that punctuation.

Can anyone come up with a way to include first words with punctuation? This is better than a sharp stick in the eye (i.e., better than manually finding and retyping the start of every chapter). But, I'd like to work through this.
enuddleyarbl is offline   Reply With Quote
Old 02-11-2023, 05:24 PM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,045
Karma: 129333562
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Or just learn to deal with it as it's done a lot.
JSWolf is online now   Reply With Quote
Advert
Old 02-12-2023, 02:58 AM   #3
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
So, what is a word (for your purposes)? I think it's something like:

Code:
[A-Z0-9][A-Z0-9\.,…’“”!?—-]*
A letter/digit followed by any number (as many as possible, possibly zero) of letter/digit/punctuation. You may want to include “ and ‘ in the "initial" class, and maybe & along with all the letters.

Now you want a number of words separated by spaces, how about (untested):

Code:
({word}\s+)\b
i.e.
Code:
([A-Z0-9][A-Z0-9\.,…’“”!?—-]*\s+)\b
Jellby is offline   Reply With Quote
Old 02-12-2023, 09:09 AM   #4
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,173
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
And there are edge cases like FBI, NATO etc.

I personally gave up and just find each chapter start and manually edit. I'll remove stuff for small caps and drop caps at the start of a chapter (or anywhere) automatically as that seems safe.

I found in the Wordprocessor that applying Sentence Case to First Paragraphs doesn't work due to Proper names and similar.
Quoth is offline   Reply With Quote
Old 02-12-2023, 11:18 AM   #5
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 734
Karma: 1077122
Join Date: Sep 2013
Device: Kobo Forma
I'll add in some of those punctuations and see if the false positives from them outweigh the false negatives of not having them.

I don't have a problem with it picking up acronyms and other capitalized words. I've got a search to go through the book looking for individual capitalized words so I can check if they should be smallcapped or otherwise formatted. This will just supplement that.

And, as you found in your word processor, I'll still have to manually check every sequence of all-cap words. Calibre doesn't have a Regex -function to Sentence Case things, so I use its Lower Case and its Title Case functions and look for proper names/etc issues to correct. This will just make it easier to find and do the initial conversion.
enuddleyarbl is offline   Reply With Quote
Advert
Old 02-12-2023, 12:18 PM   #6
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,173
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Good Luck!
Quoth is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding What Books are Missing from a series Spuggyface Reading and Management 3 02-08-2017 08:06 PM
Truncate series name using template -- Help!! (use first 3 words) PERSISTENCE Library Management 0 01-14-2017 05:17 AM
Need help finding a certain type of series - Help please! damican Reading Recommendations 31 02-21-2015 08:02 PM
TOC based on Capitalized Words buckm56 Conversion 5 06-03-2011 11:16 PM
Detect chapter headings with capitalized words fiendmish Calibre 6 05-31-2010 10:45 AM


All times are GMT -4. The time now is 10:10 AM.


MobileRead.com is a privately owned, operated and funded community.