Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 12-22-2021, 10:52 AM   #1
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
RegEx questions for Acronyms - small issue

Current book I'm cleaning up has 100s of instances of acronyms, 2 - 6 letters, with periods between them and I'd like to just have the letters without the periods.

I made 5 pairs of saved searches, one version with a trailing space and one without since that seems to be a common construction for 2, 3, 4, 5, and 6 char acronyms

e.g. A.B.C.D.E.F.space and just A.B.C.D.E.F.

Case 1 - If there's a space, then I figure it's inside a paragraph and I just want ABCDEF blah blah

Case 2 - If there's NOT a space, then I figure it's at the end of a paragraph and I want ABCDEF. Blah blah

So for Case 1

Find: (has trailing space)

Code:
([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.
Replace: (also with trailing space

Code:
\1\2\3\4\5\6

and for Case 2

Find: (NO trailing space)

Code:
([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.([A-Z])\.
Replace: (NO trailing space, but period instead

Code:
\1\2\3\4\5\6.
I have to be careful to run them in order:6+space, 6-no space, 5+space, 5-no Space, .... otherwise substrings would get replaced

I figure that there HAS to be a more intelligent way to set up these searches, and hopefully someone can suggest an idea

Last edited by phossler; 12-22-2021 at 09:46 PM.
phossler is offline   Reply With Quote
Old 12-22-2021, 02:57 PM   #2
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,584
Karma: 14328510
Join Date: Nov 2019
Device: none
Regex has a count/number thing where you can specify how many times it matches. I've never used it and can't remember how it works off the top of my head but I think it might be something like the following for the letters and periods for 1 to 6 repititions:
Code:
([A-Z]\.){1,6}
But check the usual helpful regex web sites for how to do it.
hobnail is offline   Reply With Quote
Advert
Old 12-22-2021, 03:17 PM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,988
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by hobnail View Post
Code:
([A-Z]\.)\s*{1,6}
0 or 1 space but NOT in the capture
theducks is offline   Reply With Quote
Old 12-22-2021, 03:30 PM   #4
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
You want to see if your acro is at the end of a paragraph (A.B.</p>) or of a sentence (A.B. Then something) or if the is a word in lowercase

I propose you a regex function, working with this regex :
Code:
((?:\p{Lu}\.){2,})(?:(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|( \p{Ll}))
\p{Lu} is an uppercase letter, \p{Ll} lowercase, (?: is a non-capturing group.
(?:\p{Lu}\.){2,})) will capture in match.group(1) all acro. with at least 2 letters (put 3 if you want to start with acro of 3 letters).

The function will put a period or not,, depending of what is after your acronym. It is possible that it doesn't cover all cases, it's to you to check. It would be wise to polish the book first, to avoid unexpected end of line, or space before </p>, etc.
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    acro = match.group(1).replace('.', '')
    if  end := match.group(2):	# </p> or <br/> etc.
        period = '.'
    elif  end := match.group(3):	# <space>[A-Z]
        period = '.'
    elif  end := match.group(4):	# <space>[a-z]
        period = ''
    else:
        end = ''
        period = ''
  
    return acro + period + end
See if you want to consider other cases. Notice that, e.g., A.B. Lda will give AB. Lda, because of the capital letter of the next word

Last edited by lomkiri; 12-22-2021 at 04:25 PM.
lomkiri is offline   Reply With Quote
Old 12-22-2021, 04:34 PM   #5
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,584
Karma: 14328510
Join Date: Nov 2019
Device: none
Quote:
Originally Posted by theducks View Post
0 or 1 space but NOT in the capture
Doesn't the asterisk mean 0 or more? I was thinking about the space after I replied and was thinking he could use something like \s{0,1}. But I guess it doesn't matter if there is more than 1 space.

Last edited by hobnail; 12-22-2021 at 04:36 PM.
hobnail is offline   Reply With Quote
Advert
Old 12-22-2021, 04:43 PM   #6
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,988
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by hobnail View Post
Doesn't the asterisk mean 0 or more? I was thinking about the space after I replied and was thinking he could use something like \s{0,1}. But I guess it doesn't matter if there is more than 1 space.
You are correct (both )
theducks is offline   Reply With Quote
Old 12-22-2021, 04:56 PM   #7
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Quote:
Originally Posted by hobnail View Post
Regex has a count/number thing where you can specify how many times it matches. I've never used it and can't remember how it works off the top of my head but I think it might be something like the following for the letters and periods for 1 to 6 repititions:

Code:
([A-Z]\.){1,6}
But check the usual helpful regex web sites for how to do it.
Actually I had thought for the Find

Code:
([A-Z]\.){2,6} 

and 

([A-Z]\.){2,6}
where there is a trailing space after the first case

That would Find A.B.C.<space> but I couldn't figure out how to do the Replace (trailing space) since it depends on the number of letters in the acronym, 3 in this case

Code:
\1\2\3
phossler is offline   Reply With Quote
Old 12-22-2021, 05:35 PM   #8
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Quote:
Originally Posted by lomkiri View Post
You want to see if your acro is at the end of a paragraph (A.B.</p>) or of a sentence (A.B. Then something) or if the is a word in lowercase

I propose you a regex function, working with this regex :

See if you want to consider other cases. Notice that, e.g., A.B. Lda will give AB. Lda, because of the capital letter of the next word
Thanks

This looks a LOT easier than the brute force way I tried

I can single step through and address the A.B. Lda case

Code:
 <!-- CASE 1
      TRAILING SPACE FOLLOWED BY CAPITAL LETTER SO ASSUME END OF SENTENCE
      AND LEAVE LAST PERIOD -->
  <!-- CASE 2
        NO TRAILING SPACE SO ASSUME END OF PARAGRAPH 
        AND LEAVE LAST PERIOD -->
  <!-- CASE 3
        HAS TRAILING SPACE SO ASSUME IN MIDDLE OF SENTENCE WITH NO PERIODS -->
I have 3 cases and your function works great on case 1 and case 2. Doesn't seem to do anything for case 3. Can you tweak it a little for me please?
Attached Thumbnails
Click image for larger version

Name:	Capture.JPG
Views:	105
Size:	113.3 KB
ID:	190965  
Attached Files
File Type: epub Acronyms.epub (1.9 KB, 108 views)

Last edited by phossler; 12-22-2021 at 05:40 PM.
phossler is offline   Reply With Quote
Old 12-22-2021, 05:57 PM   #9
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by phossler View Post
I have 3 cases and your function works great on case 1 and case 2. Doesn't seem to do anything for case 3. Can you tweak it a little for me please?
OK, let's modfify the 3rd case from <space>[a-z] to <space>, and add a collector for all anything else (comma, for example, or semicolon). In both case, it removes the period

Code:
((?:\p{Lu}\.)+)(?:(</(?:p|div|b/|blockquote)>)|( \p{Lu})|(' ')|(.))
(note : the regex above is wrong and was corrected in msg #13 toward the one below:)
((?:\p{Lu}\.){2,})(?:\s*(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|(' ')|(.))
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    acro = match.group(1).replace('.', '')
    if  end := match.group(2):	# </p> or <br/> etc.
        period = '.'
    elif end := match.group(3):	# <space>[A-Z]
        period = '.'
    elif end := match.group(4):	# <space>
        period = ''
    elif end := match.group(5):	# anything else
        period = ''
  
    return acro + period + end
I've made it to be easily red and modified, but it may be shortened this way, it does exactly the same thing :
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    acro = match.group(1).replace('.', '')
    if  end := (match.group(2) or match.group(3)):
        period = '.'
    elif end := (match.group(4) or match.group(5)):
        period = ''
    return acro + period + end

Last edited by lomkiri; 12-23-2021 at 08:19 PM. Reason: correction of the regex
lomkiri is offline   Reply With Quote
Old 12-22-2021, 06:18 PM   #10
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
That is really, super-duper, extra special terrific

Works great and SO much cleaner than the way I was trying

Thanks again

PS - I'm going with the wordy version - I like wordy
phossler is offline   Reply With Quote
Old 12-22-2021, 06:55 PM   #11
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,584
Karma: 14328510
Join Date: Nov 2019
Device: none
Quote:
Originally Posted by phossler View Post
Actually I had thought for the Find

Code:
([A-Z]\.){2,6} 

and 

([A-Z]\.){2,6}
where there is a trailing space after the first case

That would Find A.B.C.<space> but I couldn't figure out how to do the Replace (trailing space) since it depends on the number of letters in the acronym, 3 in this case

Code:
\1\2\3
I think it doesn't matter if some of them aren't "filled"? E.g., for a simple example if you were using ([a-z]){1,3} for the search and x\3x for the replacement I'm guessing that you'd get xx when it matched aa. The \3 would be empty.
hobnail is offline   Reply With Quote
Old 12-22-2021, 09:24 PM   #12
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Almost there

@lomkiri --

Me again - small problem came up on my first real run

All of a sudden I got lots of errors and tracked it down to the before / after below


"www.w3" and "www.idpf.org" lost their periods

Code:
b: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en">

a: <html xmlns="http://wwww3.org/1999/xhtml" xmlns:epub="http://wwwidpforg/2007/ops" xml:lang="en">
Not sure why since "www" is lower case and should not be considered. This is the case that threw the errors, but I'm assuming that the situation could be anywhere.

Last edited by phossler; 12-22-2021 at 09:45 PM.
phossler is offline   Reply With Quote
Old 12-22-2021, 10:05 PM   #13
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
You didn't check "Case sensitive". It is mandatory, since you're looking for uppercase letters. Without checking this, then \p{Lu} (or [A-Z]) select any letter, not only uppercase.

Another mistake (mine, this time) : there was a mismatch in the history-box of the searches, and I put an old version of the string, it is not
Code:
((?:\p{Lu}\.)+)(?:(</(?:p|div|b/|blockquote)>)|( \p{Lu})|(' ')|(.))
but (as I put in my first msg):
Code:
((?:\p{Lu}\.){2,})(?:\s*(<(?:/p|/div|br/|/blockquote)>)|( \p{Lu})|(' ')|(.))
Both conditions gave this result. Terribly sorry for this !

Last edited by lomkiri; 12-23-2021 at 08:16 PM. Reason: including <br/> in the EoL tags
lomkiri is offline   Reply With Quote
Old 12-22-2021, 11:01 PM   #14
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Never mind -- I had the wrong Case Sensitive box checked

Everything works very well

Thanks again

Last edited by phossler; 12-22-2021 at 11:13 PM. Reason: Delete a PEBKAC problem message
phossler is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Acronyms? Calibre26 Calibre 2 08-07-2021 12:49 PM
Regex questions (body of text only?) rosshalde Sigil 3 10-23-2014 09:02 PM
Acronyms in old books? NASCARaddicted Upload Help 6 06-10-2014 07:39 AM
regex for character replacement, em-dash questions cybmole Calibre 3 10-18-2010 03:09 PM
Unutterably Silly Acronyms I Have Known RWood Lounge 119 09-08-2008 07:29 PM


All times are GMT -4. The time now is 02:35 AM.


MobileRead.com is a privately owned, operated and funded community.