![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
RegEx questions for Acronyms - small issue
Current book I'm cleaning up has 100s of instances of acronyms, 2 - 6 letters, with periods between them and I'd like to just have the letters without the periods.
I made 5 pairs of saved searches, one version with a trailing space and one without since that seems to be a common construction for 2, 3, 4, 5, and 6 char acronyms e.g. A.B.C.D.E.F.space and just A.B.C.D.E.F. Case 1 - If there's a space, then I figure it's inside a paragraph and I just want ABCDEF blah blah Case 2 - If there's NOT a space, then I figure it's at the end of a paragraph and I want ABCDEF. Blah blah So for Case 1 Find: (has trailing space) Code:
([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\. Code:
\1\2\3\4\5\6 and for Case 2 Find: (NO trailing space) Code:
([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\. Code:
\1\2\3\4\5\6. I figure that there HAS to be a more intelligent way to set up these searches, and hopefully someone can suggest an idea Last edited by phossler; 12-22-2021 at 09:46 PM. |
![]() |
![]() |
![]() |
#2 |
Running with scissors
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,584
Karma: 14328510
Join Date: Nov 2019
Device: none
|
Regex has a count/number thing where you can specify how many times it matches. I've never used it and can't remember how it works off the top of my head but I think it might be something like the following for the letters and periods for 1 to 6 repititions:
Code:
([A-Z]\.){1,6} |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,988
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
![]() |
![]() |
![]() |
#4 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
You want to see if your acro is at the end of a paragraph (A.B.</p>) or of a sentence (A.B. Then something) or if the is a word in lowercase
I propose you a regex function, working with this regex : Code:
((?:\p{Lu}\.){2,})(?:(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|( \p{Ll})) (?:\p{Lu}\.){2,})) will capture in match.group(1) all acro. with at least 2 letters (put 3 if you want to start with acro of 3 letters). The function will put a period or not,, depending of what is after your acronym. It is possible that it doesn't cover all cases, it's to you to check. It would be wise to polish the book first, to avoid unexpected end of line, or space before </p>, etc. Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): acro = match.group(1).replace('.', '') if end := match.group(2): # </p> or <br/> etc. period = '.' elif end := match.group(3): # <space>[A-Z] period = '.' elif end := match.group(4): # <space>[a-z] period = '' else: end = '' period = '' return acro + period + end Last edited by lomkiri; 12-22-2021 at 04:25 PM. |
![]() |
![]() |
![]() |
#5 |
Running with scissors
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,584
Karma: 14328510
Join Date: Nov 2019
Device: none
|
Doesn't the asterisk mean 0 or more? I was thinking about the space after I replied and was thinking he could use something like \s{0,1}. But I guess it doesn't matter if there is more than 1 space.
Last edited by hobnail; 12-22-2021 at 04:36 PM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,988
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
![]() |
![]() |
![]() |
#7 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Quote:
Code:
([A-Z]\.){2,6} and ([A-Z]\.){2,6} That would Find A.B.C.<space> but I couldn't figure out how to do the Replace (trailing space) since it depends on the number of letters in the acronym, 3 in this case Code:
\1\2\3 ![]() |
|
![]() |
![]() |
![]() |
#8 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Quote:
![]() This looks a LOT easier than the brute force way I tried I can single step through and address the A.B. Lda case Code:
<!-- CASE 1 TRAILING SPACE FOLLOWED BY CAPITAL LETTER SO ASSUME END OF SENTENCE AND LEAVE LAST PERIOD --> <!-- CASE 2 NO TRAILING SPACE SO ASSUME END OF PARAGRAPH AND LEAVE LAST PERIOD --> <!-- CASE 3 HAS TRAILING SPACE SO ASSUME IN MIDDLE OF SENTENCE WITH NO PERIODS --> Last edited by phossler; 12-22-2021 at 05:40 PM. |
|
![]() |
![]() |
![]() |
#9 | |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
Quote:
Code:
((?:\p{Lu}\.)+)(?:(</(?:p|div|b/|blockquote)>)|( \p{Lu})|(' ')|(.)) (note : the regex above is wrong and was corrected in msg #13 toward the one below:) ((?:\p{Lu}\.){2,})(?:\s*(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|(' ')|(.)) Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): acro = match.group(1).replace('.', '') if end := match.group(2): # </p> or <br/> etc. period = '.' elif end := match.group(3): # <space>[A-Z] period = '.' elif end := match.group(4): # <space> period = '' elif end := match.group(5): # anything else period = '' return acro + period + end Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): acro = match.group(1).replace('.', '') if end := (match.group(2) or match.group(3)): period = '.' elif end := (match.group(4) or match.group(5)): period = '' return acro + period + end Last edited by lomkiri; 12-23-2021 at 08:19 PM. Reason: correction of the regex |
|
![]() |
![]() |
![]() |
#10 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
That is really, super-duper, extra special terrific
![]() Works great and SO much cleaner than the way I was trying Thanks again PS - I'm going with the wordy version - I like wordy ![]() |
![]() |
![]() |
![]() |
#11 | |
Running with scissors
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,584
Karma: 14328510
Join Date: Nov 2019
Device: none
|
Quote:
|
|
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Almost there
@lomkiri --
Me again - small problem came up on my first real run All of a sudden I got lots of errors and tracked it down to the before / after below "www.w3" and "www.idpf.org" lost their periods Code:
b: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en"> a: <html xmlns="http://wwww3.org/1999/xhtml" xmlns:epub="http://wwwidpforg/2007/ops" xml:lang="en"> Last edited by phossler; 12-22-2021 at 09:45 PM. |
![]() |
![]() |
![]() |
#13 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
|
You didn't check "Case sensitive". It is mandatory, since you're looking for uppercase letters. Without checking this, then \p{Lu} (or [A-Z]) select any letter, not only uppercase.
Another mistake (mine, this time) : there was a mismatch in the history-box of the searches, and I put an old version of the string, it is not Code:
((?:\p{Lu}\.)+)(?:(</(?:p|div|b/|blockquote)>)|( \p{Lu})|(' ')|(.)) Code:
((?:\p{Lu}\.){2,})(?:\s*(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|(' ')|(.)) Last edited by lomkiri; 12-23-2021 at 08:16 PM. Reason: including <br/> in the EoL tags |
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Never mind -- I had the wrong Case Sensitive box checked
![]() ![]() ![]() Everything works very well ![]() Thanks again ![]() Last edited by phossler; 12-22-2021 at 11:13 PM. Reason: Delete a PEBKAC problem message |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Acronyms? | Calibre26 | Calibre | 2 | 08-07-2021 12:49 PM |
Regex questions (body of text only?) | rosshalde | Sigil | 3 | 10-23-2014 09:02 PM |
Acronyms in old books? | NASCARaddicted | Upload Help | 6 | 06-10-2014 07:39 AM |
regex for character replacement, em-dash questions | cybmole | Calibre | 3 | 10-18-2010 03:09 PM |
Unutterably Silly Acronyms I Have Known | RWood | Lounge | 119 | 09-08-2008 07:29 PM |