Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-26-2011, 03:17 PM   #1
flinkdeldinky
Junior Member
flinkdeldinky began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2011
Device: none
Need a regex for importing books

Imported a bunch of books into Calibre the normal way. Calibre got the metadata from most book files okay (they're pdf files) but in many cases it pretty much fubar'd alot of files. My idea is to clear out all the fubar'd files from Calibre and re-import them using a regex.

Unfortunately I'm not e regex guy and I found no useful examples to help me out. I only got a little bit of the way in figuring out a regex.

The file names are formatted as such.

isbn.publisher.title.date.pdf

Just to make things interesting all words are ended with a period. Publisher (always the same three words) and title (variable number of words) and date (month in 3 letters style then year in four digits).

Examples:
012345678X.This.Is.Publisher.This.is.a.Title.Apr.2 007.pdf
876543210x.This.Is.Publisher.A.Different.Title.Tha t.is.Longer.Jan.1997.pdf

This is the best regex I could get and it only gets isbn correct:

(?P<isbn>[0-9]+[A-Za-z])\.(?P<publisher>[A-Za-z]+\.[A-Za-z]+\.[A-Za-z]+)

When run on the second example:

isbn = 876543210x
publisher = This.Is.Publisher
and for some reason
title = 876543210x.This.Is.Publisher.A.Different.Title.Tha t.is.Longer.Jan.1997

I have no idea how to remove the periods from publisher. No idea how to get variable length titles. No idea how to get the dates.

Anybody got good grep out there?
flinkdeldinky is offline   Reply With Quote
Old 07-26-2011, 04:25 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by flinkdeldinky View Post
Anybody got good grep out there?
Try this:
Code:
(?P<isbn>\d+\w)\.(?P<publisher>\w+\.\w+\.\w+)\.(?P<title>.*)\.(?P<published>\w\w\w\.\d+)
It assumes three character month abbreviations. It doesn't remove periods, except between fields. It assumes three word publisher names.

Last edited by Starson17; 07-26-2011 at 04:29 PM.
Starson17 is offline   Reply With Quote
Advert
Old 07-26-2011, 04:44 PM   #3
flinkdeldinky
Junior Member
flinkdeldinky began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2011
Device: none
Thanks Starson17. Works as advertised. As you said id doesn't remove the periods on the title or publisher. Is that just not possible on file import? Is it possible on a bulk metadata search and replace?

My ultimate idea is t get all the info I can out of the file name and then let calibre do an internet lookup and hope it finds matches.

Again, thanks for the regex. Going to try and figure it out. I'm pretty impressed you could do the varaible title lengths. Not to surprised about the periods though as that's less of a grep and more of an edit. Hopefully the bulk metadata search and replace can do it.
flinkdeldinky is offline   Reply With Quote
Old 07-26-2011, 04:56 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by flinkdeldinky View Post
Thanks Starson17. Works as advertised.
You're welcome.
Quote:
As you said id doesn't remove the periods on the title or publisher. Is that just not possible on file import?
No, it's not possible. File import can only get the character string in the filename, it's not designed to change those characters.
Quote:
Is it possible on a bulk metadata search and replace?
Yes, do it that way.
Quote:
I'm pretty impressed you could do the varaible title lengths.
Titles always vary - the trick is to find things on either side of it to match. In your case, I relied on the year digits and three characters for the month abbreviation. Those were stripped off the end, and everything in front was stripped off to match isbn, etc. What was left was title.
Starson17 is offline   Reply With Quote
Old 07-27-2011, 09:41 AM   #5
flinkdeldinky
Junior Member
flinkdeldinky began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2011
Device: none
Starson17, just wanted to let you know that everything went very well. I'm truly amazed. Your regex worked flawlessly and correcting the period problem with the Publisher and Title fields in the bulk metadata search and replace was super easy. Then I just did a bulk internet metadata search and everything went perfect.

Just want to thank you one last time. I really appreciate your effort.
flinkdeldinky is offline   Reply With Quote
Advert
Old 07-27-2011, 02:14 PM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by flinkdeldinky View Post
Starson17, I really appreciate your effort.
.
Starson17 is offline   Reply With Quote
Old 10-25-2011, 08:25 PM   #7
sdspieg
Connoisseur
sdspieg began at the beginning.
 
Posts: 54
Karma: 10
Join Date: Jun 2009
Device: Nook, Kindle 3
Hmmm - I still miss a few steps here. I have a similar issue - my titles contain the ISBN number, followed by a period and then publisher (followed by a period - and btw in my case the publisher can have from one to four words), Title (again any combination of words, each word followed by a period) and Date (followed by a period).
All I'd like to do is to extract the ISBN number from the tile and copy it to the identifier/isbn field - so that I can then do an automatic bulk download of metadata and covers.
Can somebody please explain how to do that? Here's what I tried: Edit Metadata|Search and replace; search mode=regex, search field=title, search for=??[sthg like your regex expression I presume]; replace with [how do I JUST get the ACTUAL isbn# here]; destination field identifier|isbn...
Thanks a bunch in advance!

-Stephan
sdspieg is offline   Reply With Quote
Old 10-26-2011, 02:36 AM   #8
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by sdspieg View Post
Hmmm - I still miss a few steps here...
Provide examples of the titles/filenames/whatever you want to capture from.

Makes life a whole lot easier.

Last edited by Serpentine; 10-26-2011 at 02:38 AM. Reason: oops
Serpentine is offline   Reply With Quote
Old 10-27-2011, 06:04 AM   #9
salines
Zealot
salines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enough
 
Posts: 127
Karma: 744
Join Date: Oct 2011
Device: Sony PRS-T1
Hi,
I also have some problems with a regex for importing books.

My books look like this:
"author - title.epub"
"author - series xx - title.epub"

That work fine IF author NOT looks like: "Brain, Master-Mind"

The "-" within the author splits the "Master-Mind" and makes the "Mind..." to the series.

BTW: I always use " - " as separator.

Please help me. I tried it alone - but....no success....
salines is offline   Reply With Quote
Old 10-27-2011, 07:34 AM   #10
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,812
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by salines View Post
Hi,
I also have some problems with a regex for importing books.

My books look like this:
"author - title.epub"
"author - series xx - title.epub"

That work fine IF author NOT looks like: "Brain, Master-Mind"

The "-" within the author splits the "Master-Mind" and makes the "Mind..." to the series.

BTW: I always use " - " as separator.

Please help me. I tried it alone - but....no success....
do you use just -
or do you use <space>-<space>

there is a difference (and it avoids hyphenated words )
theducks is offline   Reply With Quote
Old 10-27-2011, 07:42 AM   #11
salines
Zealot
salines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enough
 
Posts: 127
Karma: 744
Join Date: Oct 2011
Device: Sony PRS-T1
Quote:
Originally Posted by theducks View Post
do you use just -
or do you use <space>-<space>

there is a difference (and it avoids hyphenated words )
I'm using <space>-<space>.

Last edited by salines; 10-27-2011 at 07:50 AM.
salines is offline   Reply With Quote
Old 10-27-2011, 06:11 PM   #12
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
It would help to have your base regex - it may have something special you use elsewhere.
If not, try this :
Code:
(?P<author>.+?) - (?:(?P<series>.+?) (?:(?P<series_index>\d+(?:\.\d+)?) - )?)?(?P<title>.+)

or with space as whitespace:
(?P<author>.+?)\s-\s(?:(?P<series>.+?)\s(?:(?P<series_index>\d+(?:\.\d+)?)\s-\s)?)?(?P<title>.+)

Last edited by Serpentine; 10-27-2011 at 06:26 PM. Reason: improved series index
Serpentine is offline   Reply With Quote
Old 10-28-2011, 02:12 AM   #13
salines
Zealot
salines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enoughsalines will become famous soon enough
 
Posts: 127
Karma: 744
Join Date: Oct 2011
Device: Sony PRS-T1
Quote:
or with space as whitespace:
(?P<author>.+?)\s-\s(??P<series>.+?)\s(??P<series_index>\d+(?:\. \d+)?)\s-\s)?)?(?P<title>.+)
"Fielding, Joy-ebbes - serie 01 - Tanz, Püppchen, tanz.pdf"
Works fine. Thank you!

But for
"Fielding, Joy-ebbes - Tanz, Püppchen, tanz.pdf"
it doesn't work.
author is ok: "Fielding, Joy-ebbes"
-> Series is here "Tanz,"
title is: "Püppchen, tanz!

Other question:
Should I switch the used regex if I add book for series and none series?
How can I switch the used regex for adding books fast?
salines is offline   Reply With Quote
Old 10-28-2011, 06:59 AM   #14
sdspieg
Connoisseur
sdspieg began at the beginning.
 
Posts: 54
Karma: 10
Join Date: Jun 2009
Device: Nook, Kindle 3
Here are some examples:
01505798756X.Silly Press.The.Strange.Professional.Title.Jul.1985
043165591X.Wharton.School.Publishing.The.Delight.o f.Very.Silly.Titles.Hidden.Sep.2006

Thanks!
sdspieg is offline   Reply With Quote
Old 10-28-2011, 07:20 AM   #15
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,812
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by sdspieg View Post
Here are some examples:
01505798756X.Silly Press.The.Strange.Professional.Title.Jul.1985
043165591X.Wharton.School.Publishing.The.Delight.o f.Very.Silly.Titles.Hidden.Sep.2006

Thanks!
No reasonable way to determine which words belong together with (spaces) and which are new fields.
theducks is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex: File Renaming Pre-Import & Importing penguinaka Library Management 20 08-14-2012 06:11 PM
Importing RegEx Line TheEldest Calibre 1 07-05-2011 10:18 PM
understandng the sample add books regex cybmole Library Management 11 03-02-2011 06:08 AM
A little help adding books and using regex. Dragonator Calibre 7 12-17-2010 06:57 PM
regex Issue when Importing river Calibre 3 06-16-2009 11:03 AM


All times are GMT -4. The time now is 06:24 AM.


MobileRead.com is a privately owned, operated and funded community.