Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 10-12-2011, 04:52 PM   #1
mmholt
MrsUndertaker
mmholt began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Sep 2011
Device: Galaxy Tab S2 9.7
Help with a search & replace

I want to locate any authors with two or more initials with periods, each separated by a space, like "A. B." or "C. D. E."

I've worked out a regex that will find them, but I can't figure out how to remove the space between two initials. Help?
mmholt is offline   Reply With Quote
Old 10-13-2011, 03:53 AM   #2
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
I don't understand what you want to remove. Could you post some examples of what values you currently have in your library and what you want to change them to?

Spaces can best be matched by using either a space or \s for general whitespace matching.
Manichean is offline   Reply With Quote
Advert
Old 10-13-2011, 03:56 AM   #3
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,739
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Search field: authors
Search for: (\w\.) (?=\w\.)
Replace with: \1
chaley is offline   Reply With Quote
Old 10-14-2011, 01:16 PM   #4
mmholt
MrsUndertaker
mmholt began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Sep 2011
Device: Galaxy Tab S2 9.7
Quote:
Originally Posted by Manichean View Post
I don't understand what you want to remove. Could you post some examples of what values you currently have in your library and what you want to change them to?
If an author's name contains a string like "A. A. A." or "A. A." I wanted to replace those with "A.A.A." or "A.A."


Quote:
Originally Posted by chaley View Post
Search field: authors
Search for: (\w\.) (?=\w\.)
Replace with: \1
That does exactly what I wanted - thank you very much. But I don't understand why it works. Enlighten me, please?
mmholt is offline   Reply With Quote
Old 10-14-2011, 02:22 PM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by mmholt View Post
Enlighten me, please?
Code:
Search for: (\w\.) (?=\w\.)
Replace with: \1
\w is a single word character (like a letter)
\w\. is a single word character followed by a period (the \. means a period, while the dot alone without the escape backslash is a wild card for any character.)
So "(\w\.) " means a single word character followed by a period followed by a space (note the space there).

(?=\w\.) means to only find "a single word character followed by a period followed by a space" if the space is followed by "a single word character followed by a period". The pattern (?= is a positive lookahead assertion. It lets the preceding match only when the following matches, but the lookahead part doesn't "eat up" any of the string.

For example, the regex "Isaac (?=Asimov)" will match "Isaac " only if it’s followed by "Asimov".

Last edited by Starson17; 10-14-2011 at 02:27 PM.
Starson17 is offline   Reply With Quote
Advert
Old 10-14-2011, 07:31 PM   #6
mmholt
MrsUndertaker
mmholt began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Sep 2011
Device: Galaxy Tab S2 9.7
Thanks for all the details. I still don't understand the "Replace with". How does that remove the space?
mmholt is offline   Reply With Quote
Old 10-14-2011, 08:36 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,792
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by mmholt View Post
Thanks for all the details. I still don't understand the "Replace with". How does that remove the space?
I think that should be (a back reference for each match)
\1\2

Since the spac between other match elements is not in side a reference (), it will be lost when replacing only the 2 back references.
theducks is offline   Reply With Quote
Old 10-14-2011, 09:21 PM   #8
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 12,162
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
Quote:
Originally Posted by theducks View Post
I think that should be (a back reference for each match)
\1\2

Since the spac between other match elements is not in side a reference (), it will be lost when replacing only the 2 back references.
Actually, I think the way it works is that the entire regex ONLY matches the first occurence of a single word character followed by a period and space.

Since the (?=\w\.) is as chaley says "(?= is a positive lookahead assertion. It lets the preceding match only when the following matches, but the lookahead part doesn't "eat up" any of the string." this means the only characters "consumed" by the reg ex. are the initial sequence "(\w\.) " and that is replaced by the (1) which is that initial \w\. sequence.
PeterT is offline   Reply With Quote
Old 10-14-2011, 09:23 PM   #9
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by theducks View Post
I think that should be (a back reference for each match)
\1\2
Since he already stated that the S&R worked flawlessly I'm guessing the \2 isn't required.

Quote:
Originally Posted by theducks View Post
Since the space between other match elements is not in side a reference (), it will be lost when replacing only the 2 back references.
Good explanation.
DoctorOhh is offline   Reply With Quote
Old 10-15-2011, 03:07 AM   #10
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,739
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by theducks View Post
I think that should be (a back reference for each match)
\1\2
No. The second parenthesized expression (the positive lookahead) does not create a group that can be back referenced. Adding the \2 will generate an error, because there is only one group.

Quote:
Originally Posted by PeterT View Post
Actually, I think the way it works is that the entire regex ONLY matches the first occurence of a single word character followed by a period and space.
It matches N occurrences of "letter dot space" -- an "initial". What it does not match is the last initial, preventing removing the space between that last initial and the following word.
Quote:
Since the (?=\w\.) is as starson says "(?= is a positive lookahead assertion. It lets the preceding match only when the following matches, but the lookahead part doesn't "eat up" any of the string." this means the only characters "consumed" by the reg ex. are the initial sequence "(\w\.) " and that is replaced by the (1) which is that initial \w\. sequence.
One thing to remember: matching and substitution in calibre's search/replace (and generally in regular expressions) is leftmost non-overlapping. This means that the expression will operate on the first string that matches, then start again at the left side of what remains. Because the lookahead assertion does not consume characters, what "remains" is the next initial, and the regexp process is run again on that initial and whatever follows it. This process repeats until the expression fails to match something, which will happen when there are no remaining initials followed by an initial.

Note that "leftmost-overlapping" does not imply either "adjacent" or "leading". It simply means that the input string is scanned from left to right. For example: regarding adjacent, there is no requirement that there be only one set of initials. Given the rather bizarre author name "A. B. Someword C. D. Lastname", the expression will match the A. and the C., resulting in "A.B. Someword C.D. Lastname". Regarding leading: the name "Joe A. B. Smith" will be changed to "Joe A.B. Smith".
chaley is offline   Reply With Quote
Old 10-17-2011, 09:10 AM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by mmholt View Post
Thanks for all the details. I still don't understand the "Replace with". How does that remove the space?
You got lots of excellent information, but in case the answer to this question wasn't clear: It removes the space because there is a space after "(\w\.)" in the "Search for" part. That means the space and the word character (followed by period) will all be "eaten up" or as chaley correctly put it "consumed" by the search. Of course, those three characters will only be eaten up if the positive lookahead assertion is matched (another (\w\.) follows the first one.) However the "Replace with" part doesn't have a space. It has just a match for the group of two characters - (\w\.) - word character followed by period. So the three character string: "word character-period-space" that is consumed (subject to the lookahead) gets replaced with a two character string that is the same as the three character string, minus the space. As chaley said, the process then starts again.

Simple
Starson17 is offline   Reply With Quote
Old 10-21-2011, 06:49 PM   #12
mmholt
MrsUndertaker
mmholt began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Sep 2011
Device: Galaxy Tab S2 9.7
Thank you all for the awesome replies. It was all very helpful!
mmholt is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Search & Replace :help: krussell Calibre 3 08-02-2011 04:45 PM
Search & Replace/Regex help!! millertime13 Conversion 4 07-22-2011 02:40 AM
Search & Replace Suggestion Philosopher Calibre 6 12-31-2010 11:55 AM
Search & Replace Pat Nickholds Sigil 2 10-21-2010 11:18 PM
Search & replace TEXT ToeRag Calibre 3 04-10-2010 01:44 PM


All times are GMT -4. The time now is 10:13 AM.


MobileRead.com is a privately owned, operated and funded community.