Seriously, how to parse metadata from filenames

charlweed · 07-09-2011, 12:22 AM

Hi!
I know what a regular expression is, and GENERALLY how to use them. I don't know Python, but I read the link. What I can't figure out, is how to parse a filename into Calibre metadata. I read the tutorial, it was not too helpful. I clicked the checkbox that made me hope that Calibe would use the filename.
I am trying to parse filenames like:

Code:

tb-2099 California microbial life (john adams) 1999

On Various online python tools, I can verify the expression

Code:

(.*\d\s)(.*)\s\((j.*)\)\s(\d*).*

But when I try to use symbolic group name like

Code:

(.*\d\s)(.*)\s\((?P<author>.*)\)\s(\d*).*

, then I get nothing from the test button in the "Adding Books" dialog.

How do I really extract the metadata from a filename?
Thanks so much!

chaley · 07-09-2011, 03:24 AM

Does your test file name have an extension? Won't work without it.

charlweed · 07-09-2011, 10:47 PM

Yes, I am working with plain text files with a ".txt" extension. So it would be
tb-2099 California microbial life (john adams) 1999.txt

Is there documentation somewhere for the symbolic names that can be used for expressions? For example is it "(?P<author>.*)" or (?P<authors>.*)?
Does case matter? Does import fail if their is whitespace?

charlweed · 07-09-2011, 11:16 PM

Cool, I just discovered the mouse-over feature.

DoctorOhh · 07-09-2011, 11:37 PM

Quote:

Originally Posted by charlweed

Yes, I am working with plain text files with a ".txt" extension. So it would be
tb-2099 California microbial life (john adams) 1999.txt

He was just pointing out that if you don't include the file extension in the test window then you won't get any results when you press the test button.

Manichean · 07-12-2011, 02:15 AM

Quote:

Originally Posted by charlweed

I don't know Python, but I read the link. What I can't figure out, is how to parse a filename into Calibre metadata. I read the tutorial, it was not too helpful.

If you're talking about one of the tutorials/guides in the stickies here, suggestions for improvements would be much appreciated.

charlweed · 07-17-2011, 09:32 PM

As Calibre is great software, I will try to respond with some good, usable suggestions. In the meantime, here is what I eventually settled with.

Code:

 ((?P<series>\w+)?\W(?P<series_index>\d+).+?)?(?P<title>.*)\s+\((?P<author>.*)\)\s?(?P<published>\d+)?.*

This is very much a PERL thing, and I could not find a web tool that can parse it. None-the-less, the KEY for me was that if Calibre cannot match the entire expression, it dumps everything into <title>. If <title> is not in the expression, it seems to do nothing.
My first suggestion is that the test functionality do as-you-type validation and matching of the expression, so that the user knows when Calibre is not going to find any data given the expression and sample.
For the tutorial,it should explicitly state that the Calibre regular expressions are a extension of other regular expression ... um, grammars. And detail how symbolic grouping works, and how general parenthetical grouping works.
A table of recipes for pulling data out of some sample strings would be great. Maybe I can help with the first of those.

Starson17 · 07-18-2011, 01:58 PM

Quote:

Originally Posted by charlweed

None-the-less, the KEY for me was that if Calibre cannot match the entire expression, it dumps everything into <title>. If <title> is not in the expression, it seems to do nothing.

As to entire matching: There's no requirement to match the entire expression. As long as the title field gets matched, Calibre is happy. You can have subsequent series and series_index fields that match nothing in the filename, followed by an author field that matches something, and Calibre will happily use just the matched fields and ignore text that didn't match a field and ignore fields that didn't match any text.

Calibre must have a title for every book. If the regex you wrote for the title field doesn't match something (or you've omitted it), Calibre gives up on your regex and reverts to using the entire filename as the title and Unknown as the author.

07-09-2011, 12:22 AM	#1
charlweed Enthusiast Posts: 27 Karma: 30 Join Date: Jul 2011 Device: none	Seriously, how to parse metadata from filenames Hi! I know what a regular expression is, and GENERALLY how to use them. I don't know Python, but I read the link. What I can't figure out, is how to parse a filename into Calibre metadata. I read the tutorial, it was not too helpful. I clicked the checkbox that made me hope that Calibe would use the filename. I am trying to parse filenames like: Code: tb-2099 California microbial life (john adams) 1999 On Various online python tools, I can verify the expression Code: (.\d\s)(.)\s\((j.)\)\s(\d).* But when I try to use symbolic group name like Code: (.\d\s)(.)\s\((?P<author>.)\)\s(\d).* , then I get nothing from the test button in the "Adding Books" dialog. How do I really extract the metadata from a filename? Thanks so much!

07-17-2011, 09:32 PM	#7
charlweed Enthusiast Posts: 27 Karma: 30 Join Date: Jul 2011 Device: none	I settled on a solution. As Calibre is great software, I will try to respond with some good, usable suggestions. In the meantime, here is what I eventually settled with. Code: ((?P<series>\w+)?\W(?P<series_index>\d+).+?)?(?P<title>.)\s+\((?P<author>.)\)\s?(?P<published>\d+)?.* This is very much a PERL thing, and I could not find a web tool that can parse it. None-the-less, the KEY for me was that if Calibre cannot match the entire expression, it dumps everything into <title>. If <title> is not in the expression, it seems to do nothing. My first suggestion is that the test functionality do as-you-type validation and matching of the expression, so that the user knows when Calibre is not going to find any data given the expression and sample. For the tutorial,it should explicitly state that the Calibre regular expressions are a extension of other regular expression ... um, grammars. And detail how symbolic grouping works, and how general parenthetical grouping works. A table of recipes for pulling data out of some sample strings would be great. Maybe I can help with the first of those.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Filenames to metadata, preserving filenames.	nitrogun	Calibre	5	09-13-2010 10:50 PM
Initial parse failed:	mburgoa	Calibre	4	08-07-2010 08:50 AM
PDF Filenames vs Metadata Title	clintbradford	Calibre	0	07-12-2010 11:50 PM
batch metadata editing possible from filenames?	caponesan	Reading and Management	3	09-03-2009 12:50 PM
libprs500 metadata from filenames	Dan23	Calibre	2	06-29-2008 06:04 PM

07-09-2011, 03:24 AM	#2
chaley Grand Sorcerer Posts: 11,741 Karma: 6997045 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Does your test file name have an extension? Won't work without it.

07-09-2011, 10:47 PM	#3
charlweed Enthusiast Posts: 27 Karma: 30 Join Date: Jul 2011 Device: none	Yes, I am working with plain text files with a ".txt" extension. So it would be tb-2099 California microbial life (john adams) 1999.txt Is there documentation somewhere for the symbolic names that can be used for expressions? For example is it "(?P<author>.)" or (?P<authors>.)? Does case matter? Does import fail if their is whitespace?

07-09-2011, 11:16 PM	#4
charlweed Enthusiast Posts: 27 Karma: 30 Join Date: Jul 2011 Device: none	Cool, I just discovered the mouse-over feature.

Advert

Advert