View Single Post
Old 03-01-2023, 05:48 PM   #1
Tillomar
Bookworm
Tillomar began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2023
Location: Germany
Device: Kindle Keyboard + Paperwhite 1 + Paperwhite 3
Calibre RegExp search: unexpected results

Hi there...

Sorry to jump in with this rather esoteric question on my first post to this forum -- but everything less complicated has had solutions already, so there was never a need to register... [My thanks for that!]

Caveat: I'm using a localized German version of Calibre 6.13, so my English links/names/... to certain calibre functions and dialogues may not be fully accurate.

I'm trying to subselect my library with a regexp search in order to then work the remaining books with a "metadata/search&replace" operation.

Context:
When importing books, often the title contains information about the series and the series number. It would be convenient to separate these attributes using the regexp available at "add books/read metadata". However, there are so many different formattings of those attributes that I was unable to come up with a regexp that catches at least most of them. As this dialogue has no way to use more than one reqexp, I have to do that myself.
Additionally, I want to shorten series information like "A ... series book 15" to "... 15" while letting "A ... book 23" stand at "A ... 15", because in the former case the "A" is not part of the series title.
Obviously, after extracting the series name, I will also extract the series number, and then remove the series name from the title...

One of my regexp to search for a specific class of titles is this:

Code:
title:"~\((?:(An?|The)\s+)(?P<sname>[^\)]*?)(?:[,-:]?\s*)(?:(Small\s+Town|Trilogy|Series|Roman(ce|tic)|Cozy|Crime|Thrillers?|Suspense|Myster(y|ies))([\s:,]*))*Series\s*(?:(No.?|Number|Volume|Book|(Book\s*)\#)\s*)\#?(?P<sno>\d+([.,]\d+)?)\)"
As you can see, the expressen tries to match a text sourrounded by round brackets. In this case, I search series information in the form
Code:
(The ... Series Book 1)
(A ... Series Book 2)
(An ... Series Book 3)
which I will later shorten to
Code:
... #
From my current library, the result from this search is 1209 books, and there are a lot of names which should not be matched. Some examples of name classes which should not be matched:

Code:
Once Upon A Death (Days Of Death Series Book 1)
BloodGifted: The Dantonville Legacy Series Book 1 (A Paranormal Romance)
Poor Boy Road: A Gritty Hard-Hitting Thriller Series Book # 1 (JAKE CALDWELL)
Alexa O'Brien Huntress Series Book 1-4 Box Set
The Trouble with Bree: The Spotlight Series Book 1.5
#1 does not have "A", "An" or "The" after the opening bracket.
#2 + #3 do not have a series number in front of the closing bracket.
#4 + #5 have no brackets at all.

When I test my expression against the names found by calibre, those names (name classes) are correctly not matched.

Can anyone help me to understand what's goin wrong here?

Tnx,
Tillomar
Tillomar is offline   Reply With Quote