MobileRead Forums - View Single Post - How should file names be parsed and prepared for calibre import? Use cases requested

GlennMaples · 01-09-2011, 12:41 AM

I am cleaning up my library of e-books and wrote a program to do most of the grunt work. I am throwing the file & folder names into a sql database and running a cleaning program. Then I will use an GUI app to "touch up" the mal-contents before reorganizing the files. Finally I will import into calibre.

Of course this will require some manual work, but I am trying to minimize this with general rules.

I would like to get your opinions on the four questions below:

1) suppose I have two copies of the same book in the same format. Any good way to name them to import them both into calibre (perhaps one copy is better than the other, but it is unknown at this time which is the best)?

2) should I store compressed books as compressed (zips...rars) in the final folder to be imprted into calibre? Similarly, what is the best way to treat multi file books (html, jpeg, txt all part of the same book)-- compressed in a zip file or just left in the same parent folder?

3) should the authors be John Smith or Smith, John in the file names? I am leaning toward the second as it will make it easier for calibre to recognize the lastname of someone like Boris van Welke Jr.

4) what are your thoughts on parenthesis in book file names? Right now I am leaning toward parsing them out and rewrite using dashes.

I am planning on storing each book in a separate file with a folder named as to the file name (Smith, John & Davis, Eddie - XXX series - Other note - Title.ext) in preparation on importing into calibre.

Any other recommendation/thoughts?

If anyone would like the program I could package it up -- But it might take a little time--right now it is in C# and using a SQL server DB -- darned if I could get it going with the CE edition.

Here are some of the rules:

Tokenization:

1) A.ext

A= title

2) A-B.ext
A = Authors
B = title

3) A-B-C.ext
A=name
B=Series
C=title

4)A-B-C-D.ext
A=Authors
B=series
C=Other
D=Title

I was going to look for long number strings and store as ISBN, but none of my files have one: so I commented this out :-)

Standardization:

1) if there are no dashes look for excess of periods and convert to dashes

2) if there are no spaces look for cap letters (make sure not all caps) and add spaces before the caps (except leading letter of filename)

AdamSmith.TheWealthOfNations.txt --> Smith, Adam - The Wealth of Nations

3) trim all strings

4) Eliminate multiple dashes & spaces

5)Convert underscores to spaces if there are dashes in the filename -- otherwise convert them to dashes.

Authors:

1) Look for "shorties" and (e.g., von, van) and treat as part of last name
2) look for suffixes (e.g., jr.) and treat as part of last name
3) Use and, AND, And, & as name separators

Common multiple name forms to be parsed correctly:

john smith and Jane doe
Smith,john and Doe, jane
Tom Brady, john smith, and paul bunyon
Tom Brady, john smith & paul bunyon
Tom Brady & john smith & paul bunyon...
George and Martha Washington

If you are interested I would be glad to see some test cases from you!
Thanks again-

-glenn

01-09-2011, 12:41 AM	#1
GlennMaples Member Posts: 17 Karma: 10 Join Date: May 2010 Device: none	How should file names be parsed and prepared for calibre import? Use cases requested I am cleaning up my library of e-books and wrote a program to do most of the grunt work. I am throwing the file & folder names into a sql database and running a cleaning program. Then I will use an GUI app to "touch up" the mal-contents before reorganizing the files. Finally I will import into calibre. Of course this will require some manual work, but I am trying to minimize this with general rules. I would like to get your opinions on the four questions below: 1) suppose I have two copies of the same book in the same format. Any good way to name them to import them both into calibre (perhaps one copy is better than the other, but it is unknown at this time which is the best)? 2) should I store compressed books as compressed (zips...rars) in the final folder to be imprted into calibre? Similarly, what is the best way to treat multi file books (html, jpeg, txt all part of the same book)-- compressed in a zip file or just left in the same parent folder? 3) should the authors be John Smith or Smith, John in the file names? I am leaning toward the second as it will make it easier for calibre to recognize the lastname of someone like Boris van Welke Jr. 4) what are your thoughts on parenthesis in book file names? Right now I am leaning toward parsing them out and rewrite using dashes. I am planning on storing each book in a separate file with a folder named as to the file name (Smith, John & Davis, Eddie - XXX series - Other note - Title.ext) in preparation on importing into calibre. Any other recommendation/thoughts? If anyone would like the program I could package it up -- But it might take a little time--right now it is in C# and using a SQL server DB -- darned if I could get it going with the CE edition. Here are some of the rules: Tokenization: 1) A.ext A= title 2) A-B.ext A = Authors B = title 3) A-B-C.ext A=name B=Series C=title 4)A-B-C-D.ext A=Authors B=series C=Other D=Title I was going to look for long number strings and store as ISBN, but none of my files have one: so I commented this out :-) Standardization: 1) if there are no dashes look for excess of periods and convert to dashes 2) if there are no spaces look for cap letters (make sure not all caps) and add spaces before the caps (except leading letter of filename) AdamSmith.TheWealthOfNations.txt --> Smith, Adam - The Wealth of Nations 3) trim all strings 4) Eliminate multiple dashes & spaces 5)Convert underscores to spaces if there are dashes in the filename -- otherwise convert them to dashes. Authors: 1) Look for "shorties" and (e.g., von, van) and treat as part of last name 2) look for suffixes (e.g., jr.) and treat as part of last name 3) Use and, AND, And, & as name separators Common multiple name forms to be parsed correctly: john smith and Jane doe Smith,john and Doe, jane Tom Brady, john smith, and paul bunyon Tom Brady, john smith & paul bunyon Tom Brady & john smith & paul bunyon... George and Martha Washington If you are interested I would be glad to see some test cases from you! Thanks again- -glenn