05-18-2009, 10:42 PM | #1 |
Junior Member
Posts: 1
Karma: 10
Join Date: May 2009
Device: PRS-505
|
Metadata extract from Title
Hi,
Well I have finally given up trying to figure the regular expressions out. Hopefully someone can help. I have renamed all my book filenames as Author - Title All I am trying to figure out is how to write this as a regular expression. Sometimes I do have a series in the middle Author - Series - Title but I want to store it so that everything after the first dash is part of the title, just to keep it simple for me. Thanks for your help! |
05-19-2009, 05:47 AM | #2 |
Guru
Posts: 753
Karma: 1496807
Join Date: Jul 2008
Location: The Third World
Device: iLiad + PRS-505 + Kindle 3
|
(?P<authors>.+) - (?P<title>.+)
Should work for both. (I'm not sure about the plural "authors", you can try it without the "s" if it does not work...) |
Advert | |
|
05-24-2009, 02:50 AM | #3 |
Connoisseur
Posts: 82
Karma: 184
Join Date: Jun 2008
Device: Sony PRS-505
|
When I try that solution, I get
Title = Title Authors = Author - Series Rather than the desired Title = Title - Series Authors = Author (At least that's what the test data shows.) |
05-25-2009, 01:12 AM | #4 |
Guru
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Kad,
The regular expression that was suggested is operating correctly. Unfortunately, the results it is generating and what you wanted are not the same thing. The inconsistent data set (author - title vs author - series - title) creates a major problem in trying to get calibre to correctly import your books accurately. Off the top of my head the easiest way to correct for this is simply to import your books in two different sets. If you split your books into Series or No Series groups, changed the expression accordingly, you would be fine. While this sounds wonderful as a theory, I'm sure that several of the authors have books that follow both file naming formats. So this would be a major hassle to separate them. Or to add groups of books at a time matching whichever regular expression (regex) you're using. Not to mention we might be talking about doing all this on hundreds of books. What you need is a regex that determines whether the filename currently being tested contains two or three fields. The first field is always the author(s)/editor(s) name, so grabbing that straight off should be fine. But the next field is either the series or the title. If their is a way to determine if " - " occurs again in the filename, then you can assume that what lies between the first and second " - " literal string is the series (more likely, series & series index). Otherwise, anything remaining automatically becomes the title. While this seems straight forward enough, potential issues remain. Hyphenated author names or titles, if they specifically contain " - ", will cause the import to be performed incorrectly for that individual book. Of course, with 99% of them entered automatically, you might find the remainder acceptable for manual entry / editing. While I know what needs to be done, I'm in the same position you are, simply too much of a regex & calibre neophyte to generate something this complex. Another website I came across has a Regular Expression Tutorial; I've listed it here just in case you might find it helpful. Keep in mind calibre uses the python "flavor" of regex. (The calibre referenced site should always take precedence.) |
05-25-2009, 01:25 PM | #5 |
Guru
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Eureka!!
It occurred to me that you cannot be the only user with author - [series -] title filenames. I've seen too many references that follow that exact format in the forums. So I decided to do a search...
I found this topic, Regular Expression Help, which specifically addresses your exact needs. It even accounts for the Series Index. The reliance on " - " to separate fields was removed. This means hyphenated names, or titles with dashes, should not be a problem when importing books. Of course, that doesn't mean you won't still have some issues using it. Other potential problem areas are:
Keep in mind, if metadata is defined within the files, it will take precedence over the regex filename info unless you uncheck Preferences>General>Read metadata from files. |
Advert | |
|
05-25-2009, 01:41 PM | #6 |
Wizzard
Posts: 1,402
Karma: 2000000
Join Date: Nov 2007
Location: UK
Device: iPad 2, iPhone 6s, Kindle Voyage & Kindle PaperWhite
|
This is what I use
Code:
(?P<author>[^-]+) - ((?P<series>.+) (?P<series_index>\d+) - )?(?P<title>.+) |
05-25-2009, 05:01 PM | #7 |
Guru
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Ok, I'm taking back my "Eureka!!" now...
Ouch! Looking at your regex and testing it against the formula from the other topic, I've discovered neither one is perfect.
So far all of the formulas are dependent on " - " being used as a field delimiter only. It cannot be used for hyphenated author's names nor as a part of the series (unlikely) or title (where it is likely to occur). When an extra " - " occurs the automatic import fails as the parts of the filename are separated incorrectly. So, for instance, this example fails in all formulas to import correctly: Code:
John D. Smith - Jones - Bibliographic Perfection 1 - The Perfect Book - A Bedtime Story.pdf For my purposes I would prefer a regex that resolves all of the following correctly:
Last edited by Sabardeyn; 05-25-2009 at 05:10 PM. Reason: Added to the resolution list - large series number example. |
05-26-2009, 03:27 PM | #8 |
Connoisseur
Posts: 58
Karma: 12
Join Date: Jan 2009
Device: none
|
This is what I use, I asked around some unix forums to get the answer. I use the same dataformat. My regex does not, however, take the series number.
(?P<author>((?!\s-\s).)*)\s-(?:\s(?P<series>((?!\s-s).)*)\s-)?\s(?P<title>.*) EDIT: I just took the regex from the link, and while it wouldn't work as is, after removing the last "?", it worked with all my books. (?P<author>[^_-]+) -?\s*(?P<series>[^_0-9-]*)(?P<series_index>[0-9]*)\s*-\s*(?P<title>[^_].+) The only think I'd like tho change in that regex would be to make there have to be a space before and after a "-" to use it as a delimiter. Time to reimport my Library! Now, if only .6 was out so I could convert it into epub with perfection! Last edited by darkmonk; 05-26-2009 at 03:45 PM. |
05-26-2009, 03:49 PM | #9 | |
hopeless n00b
Posts: 5,111
Karma: 19597086
Join Date: Jan 2009
Location: in the middle of nowhere
Device: PW4, PW3, Libra H2O, iPad 10.5, iPad 11, iPad 12.9
|
Quote:
This one's bit of a stretch. You can play with the greediness to get this one to play nice. Don't have Calibre on this PC but I can play with it when I get home. |
|
05-26-2009, 05:37 PM | #10 |
Connoisseur
Posts: 58
Karma: 12
Join Date: Jan 2009
Device: none
|
From what I see, what you have there would require two series; the first taking precedence above the other. As far as I know, the epub standard does not allow for the further defining of series, but you can have custom fields - which calibre certainly does use. I wonder if we should allow a subordinate series - that, I think, would be a good idea, really. An author may create a world, and with that have many subordinate series - for example, star trek. What do you guys think? |
05-26-2009, 05:41 PM | #11 | |
hopeless n00b
Posts: 5,111
Karma: 19597086
Join Date: Jan 2009
Location: in the middle of nowhere
Device: PW4, PW3, Libra H2O, iPad 10.5, iPad 11, iPad 12.9
|
Quote:
|
|
05-26-2009, 08:30 PM | #12 | |
Connoisseur
Posts: 58
Karma: 12
Join Date: Jan 2009
Device: none
|
Quote:
oh well; my thought remains, it would probably be good to have a subordinate series, so that books like this: STAR TREK - TOS - 085 - My Brother's Keeper, Book One - Republic.pdf STAR TREK - TOS - 086 - My Brother's Keeper, Book Two - Constitution.pdf STAR TREK - TOS - 087 - My Brother's Keeper, Book Three - Enterprise.pdf Would register as Title: Author: Series: Series #: Subordinate Series: Subordinate Series # Republic STAR TREK TOS 85 My Brother's Keeper 1 Constitution STAR TREK TOS 86 My Brother's Keeper 2 Enterprise STAR TREK TOS 87 My Brother's Keeper 3 ^^ Evil lack of tables... There are many books that are in a mini series, in another series, or a world or some such. |
|
05-27-2009, 05:23 AM | #13 |
Guru
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Testing the various regular expressions, so far all of them fail at some aspect. Most notably dealing with either hyphenated names, or dashes in the series or title name. Not to mention extra spaces around dashes munging the whole thing.
Gwynevan's regex has been slightly more robust than other samples. My attempts to expand on it are not meeting with success; but I'm not a programmer or particularly good at regex. I'm trying to account for hyphenated names first as accurately importing the author is most important. I have avoided using \s to avoid later issues with making the formula work with leetspeak. But as that is a lesser concern I can forgo it. darkmonk & ilovejedd, My original attempt to post that sample book was clearly defined. But I deleted it as being unnecessary. Now I stand corrected:
I think the idea of a series and subordinate series is great. If I remember correctly Alan Dean Foster wrote in a Universe & Series (Humanx & Flinx, etc) as do some of the manga authors (called "circles" - authors write around other members stories [A uses B's characters, etc]). Universe/Setting/(Subordinate) Series, is feasible. This begs the question of input vs output. I mean, I'm trying to input what I have whereas you're establishing a valid output pattern. Ultimately this is not a problem. But it occurs to me working on the same thing might be more effective! |
05-27-2009, 04:41 PM | #14 | |
Guru
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Quote:
Try this combined regex, it should handle almost everything: (?P<author>((?!\s-\s).)*)\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).)*)\s-)?\s(?P<title>.*) Playing around with things, I managed to insert the series index portion of Gwynevan's regex into Darkmonk's regex. So far it meets all of the criteria I posted with the following exceptions:
Last edited by Sabardeyn; 05-27-2009 at 05:25 PM. Reason: Clarified results of testing |
|
05-29-2009, 03:13 AM | #15 |
Guru
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
We're Star Trekkin' now!
Hey, Darkmonk, you there? ... ... ...
Now that the effort to find a great filename regex seems complete, I had a few minutes to really look at your Star Trek post, above. The lack of tables definitely hurts my understanding of what you want. Nor did I understand what you meant by saying that calibre uses custom fields and that they can be used for sub-series despite ePUB not supporting such a feature. Despite that, how about a scheme something like the following?
I used <universe> and <setting> instead of <subordinate series> because it seems appropriate. These specific words might not adhere to the "flavor" of other books though. I avoided re-using <series> because I'm not aware if it is a reserved word in the programming language sense, either by calibre or ePUB. <Universe> would be the over-all series name including all derivative works. It would not have a <universe index> because it is simply a container for all of the Universe's constituent parts. <Setting> is the exact branch, or portion, of the Universe that is being written about. In Star Trek terms, this would be TOS, TNG, DS9, VOY, ENT. (If I understand correctly how things are arranged in the Trek books. I haven't ever read any of them.) The <setting index> reflects the publication order of the various books within this Setting (as opposed to timeline / chronological order). <Series> and <series index> follow calibre's standard fields. I assumed the more finite series name of "My Brother's Keeper" belonged here because series are not very long. Right now I think the Wheel of Time is the largest series. I don't count Trek, Buffy, Babylon 5, etc because not all of the books are concerned with a single storyline arch. Dang! I was halfway through generating the formula and something broke the whole thing. Not sure, but think I might have exceeded the max number of variables for regex. Last edited by Sabardeyn; 05-29-2009 at 03:21 AM. Reason: Corrected a few things. |
Tags |
regex, regular expressions |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
metadata interfering with title | commentator8 | Calibre | 12 | 09-27-2010 11:09 AM |
Metadata in Title/filename | mezme | Calibre | 0 | 08-18-2010 03:08 AM |
Old title appearing in metadata | Lexi Revellian | Workshop | 1 | 08-05-2010 08:52 AM |
PDF Filenames vs Metadata Title | clintbradford | Calibre | 0 | 07-12-2010 11:50 PM |
Save template for title metadata? | Bob Butler | Calibre | 10 | 03-15-2010 01:05 PM |