View Single Post
Old 01-20-2012, 11:37 AM   #274
Junior Member
kbullkar began at the beginning.
Posts: 6
Karma: 10
Join Date: Jan 2012
Location: Atlanta, GA.
Device: Kindle (2xkb,1xt,xoom,ipad,ipad2,pc,2xmac), nook(kids-ipad), ade(mac)
Data Normalization

I have been using Calibre for a little while now, and like it alot. There is some functionality I would want that I don't see, so I have been thinking about writing a plug-in that would meet my needs.

Essentially, this is about data normalization with a few extra things... This would be a single plugin that does a broad range of normalization features. It is not meant to do/replace QC or duplicate finder, just normalize the data set.

Unfortunately, when I plug in my Kindle and sync it my tags get all messed up, so this is something I go through periodically.

- Tags: These get to be a mess quickly. In a lot of cases Amazon has multiple variations of the same tags. I would want to create a tag normalization table where I have a set of master tags that I can "map" other tags. Additionally, it would provide a report of tags that are not mapped to the master list (and would leave them so as not to lose the information). I am leaning towards splitting out Tags into Genre's and Tag's using a custom column (and possibly another field for tags that are less about the book and more about the book in relation to the user, e.g. "read" "to read" "liked" "hated"), but haven't finalized my thoughts there. Also, I have been thinking about tags and series... I always find it bizarre when 2/3 books in a trilogy are marked mystery and the 3rd is marked suspense, so possibly some series-wide normalization (maybe more genre-wise than tag-wise if I split that out).

- Authors: First off, I have multiple e-readers and manage a relatively large library (8k+ books) for myself/parents/sister. Some pull in books as LN, FN some as FN LN. That is a relatively simple issue, I have a stored regexp that fixes this, though there are some times where there are exceptions (, Jr. or , III) and I need a better way of dealing with those rather than converting them back at the end (or at least automating that re-conversion). When there are more than 3 authors, I use "multiple authors" as the author name (and generally this is an anthology or short story collection) - the same goes with anything that has editor, ed. etc. in the author name. Some authors have written under pseudonyms and I like to standardize them to their "real" or most accepted current name. Some authors are listed both with and without a middle initial.

- Series: Not only do I have books that are in multiple series, but in some cases the metadata for the name of the series is recorded differently and needs to be normalized. I would also like the ability to track series by a single author vs. series with multiple authors, and use that to help with some of the author normalization. Last, I would like a report of which books in the series it thinks I am missing (and ideally would be able to search Good Reads or Amazon for the name of that book and author and search my library just to see if it is missing the series label.). I do have some "special" series numbers I use (e.g. 99.xx for extras) and lots of the series have had books added in the middle (so it is number 2.5 in the series, squeezed between 2 and 3) - and I am not planning on attacking that right now (other than not to consider a gap between 2 99 as missing books 3-98).

- Title: I have a Kindle Touch right now, so collections are broken, and one day they might return, but one thing that really annoys me is the way it sorts books. Even in a collection, it doesn't order the series by which book in the series. When I have a series, I would like the series name, number, then title to show in the title field. I am thinking I will need to move the title into a customer field and then have the real title field that is used to create the books hidden and make it a composite record. Also, there are some titles that are all-caps, and I need to convert those to title case.

Right now, this is just a pie-in-the-sky desire, but something I plan on starting to play with. My initial thought is to develop this myself for myself (no UI plugins needed, just customize the python scripts as I go), though I think it might interest other people. If I just haven't found the plugin that is out there that does these things, please point it out to me. If anyone else is interested in the same type of normalization tool and is either interested in tinkering around / learning in tandem or mentoring, think that would be cool.

Originally Posted by kiwidude View Post
@schuster - I do like the idea of being able to track down books that need larger covers or invalid metadata. It could indeed be fairly easily done as a plugin. Does anyone else have any other suggestions for criteria by which they would like to isolate books that isn't available with the current search?

I'll add a couple more plugin ideas here so I don't forget them...
  • A "Cover Generator" plugin. Calibre has one built in, but it isn't flexible for people who want to have their own cover images. Quite often particularly for the likes of short stories there is no commercial/official cover available. I remember seeing a ticket ages ago for someone wanting to the ability to have different covers for different genres. I think a plugin that let you define a bunch of possible images, and you then choose one when you generate would be useful to me at least (not a fan of the default image). Potentially there could be more control over the text that appears on it as well - any suggestions for what you might want to see let me know.
  • I mentioned on another thread an idea for a "Tag Cleaner". It would use a similar approach to that in the Goodreads Metadata download plugin of defining a mapping between input tags and tags you want to use in your library. You would have a gui allowing you to customise and control which tags map to each and discard any you do not have a mapping for. It would work across metadata sources (including the ability to remap the Goodreads tags) as you would do it as a separate step after your normal Ctrl+D download. So if you are frustrated with getting a dozen variants of "sci-fi, scifi, science fiction" this would automatically resolve them rather than manually retyping the tags yourself.
kbullkar is offline   Reply With Quote