Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-09-2011, 12:22 AM   #1
charlweed
Enthusiast
charlweed began at the beginning.
 
Posts: 27
Karma: 30
Join Date: Jul 2011
Device: none
Seriously, how to parse metadata from filenames

Hi!
I know what a regular expression is, and GENERALLY how to use them. I don't know Python, but I read the link. What I can't figure out, is how to parse a filename into Calibre metadata. I read the tutorial, it was not too helpful. I clicked the checkbox that made me hope that Calibe would use the filename.
I am trying to parse filenames like:
Code:
tb-2099 California microbial life (john adams) 1999
On Various online python tools, I can verify the expression
Code:
(.*\d\s)(.*)\s\((j.*)\)\s(\d*).*
But when I try to use symbolic group name like
Code:
(.*\d\s)(.*)\s\((?P<author>.*)\)\s(\d*).*
, then I get nothing from the test button in the "Adding Books" dialog.

How do I really extract the metadata from a filename?
Thanks so much!
charlweed is offline   Reply With Quote
Old 07-09-2011, 03:24 AM   #2
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Does your test file name have an extension? Won't work without it.
chaley is offline   Reply With Quote
Advert
Old 07-09-2011, 10:47 PM   #3
charlweed
Enthusiast
charlweed began at the beginning.
 
Posts: 27
Karma: 30
Join Date: Jul 2011
Device: none
Yes, I am working with plain text files with a ".txt" extension. So it would be
tb-2099 California microbial life (john adams) 1999.txt

Is there documentation somewhere for the symbolic names that can be used for expressions? For example is it "(?P<author>.*)" or (?P<authors>.*)?
Does case matter? Does import fail if their is whitespace?
charlweed is offline   Reply With Quote
Old 07-09-2011, 11:16 PM   #4
charlweed
Enthusiast
charlweed began at the beginning.
 
Posts: 27
Karma: 30
Join Date: Jul 2011
Device: none
Cool, I just discovered the mouse-over feature.
charlweed is offline   Reply With Quote
Old 07-09-2011, 11:37 PM   #5
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by charlweed View Post
Yes, I am working with plain text files with a ".txt" extension. So it would be
tb-2099 California microbial life (john adams) 1999.txt
He was just pointing out that if you don't include the file extension in the test window then you won't get any results when you press the test button.
DoctorOhh is offline   Reply With Quote
Advert
Old 07-12-2011, 02:15 AM   #6
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by charlweed View Post
I don't know Python, but I read the link. What I can't figure out, is how to parse a filename into Calibre metadata. I read the tutorial, it was not too helpful.
If you're talking about one of the tutorials/guides in the stickies here, suggestions for improvements would be much appreciated.
Manichean is offline   Reply With Quote
Old 07-17-2011, 09:32 PM   #7
charlweed
Enthusiast
charlweed began at the beginning.
 
Posts: 27
Karma: 30
Join Date: Jul 2011
Device: none
I settled on a solution.


As Calibre is great software, I will try to respond with some good, usable suggestions. In the meantime, here is what I eventually settled with.
Code:
 ((?P<series>\w+)?\W(?P<series_index>\d+).+?)?(?P<title>.*)\s+\((?P<author>.*)\)\s?(?P<published>\d+)?.*
This is very much a PERL thing, and I could not find a web tool that can parse it. None-the-less, the KEY for me was that if Calibre cannot match the entire expression, it dumps everything into <title>. If <title> is not in the expression, it seems to do nothing.
My first suggestion is that the test functionality do as-you-type validation and matching of the expression, so that the user knows when Calibre is not going to find any data given the expression and sample.
For the tutorial,it should explicitly state that the Calibre regular expressions are a extension of other regular expression ... um, grammars. And detail how symbolic grouping works, and how general parenthetical grouping works.
A table of recipes for pulling data out of some sample strings would be great. Maybe I can help with the first of those.
charlweed is offline   Reply With Quote
Old 07-18-2011, 01:58 PM   #8
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by charlweed View Post
None-the-less, the KEY for me was that if Calibre cannot match the entire expression, it dumps everything into <title>. If <title> is not in the expression, it seems to do nothing.
As to entire matching: There's no requirement to match the entire expression. As long as the title field gets matched, Calibre is happy. You can have subsequent series and series_index fields that match nothing in the filename, followed by an author field that matches something, and Calibre will happily use just the matched fields and ignore text that didn't match a field and ignore fields that didn't match any text.

Calibre must have a title for every book. If the regex you wrote for the title field doesn't match something (or you've omitted it), Calibre gives up on your regex and reverts to using the entire filename as the title and Unknown as the author.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Filenames to metadata, preserving filenames. nitrogun Calibre 5 09-13-2010 10:50 PM
Initial parse failed: mburgoa Calibre 4 08-07-2010 08:50 AM
PDF Filenames vs Metadata Title clintbradford Calibre 0 07-12-2010 11:50 PM
batch metadata editing possible from filenames? caponesan Reading and Management 3 09-03-2009 12:50 PM
libprs500 metadata from filenames Dan23 Calibre 2 06-29-2008 06:04 PM


All times are GMT -4. The time now is 04:12 PM.


MobileRead.com is a privately owned, operated and funded community.