MobileRead Forums - View Single Post

kiwidude · 02-09-2011, 10:57 PM

Quote:

Originally Posted by vitalichka

I've seen this thread several times while looking to workout the duplicate issue.
Then was again led here by kiwidude and dwanthny a little while ago.
I've read much of what everyone has posted here and from what I gather there seems to be no real way of fighting duplicates once all of the books have been added.
Am I understanding this right? Outside of manually going through everything yourself which in my case would be impossible.
The other issue in my case is that I had the auto merge setting off for 80% or more of the time and towards the end of the adding process turned it on as per the suggestions of the two guys above.
So this sort of complicates things eve more.

As of today in Calibre you are correct, there is no built-in functionality to help you identify the duplicates that you have already imported into your library. All you can do is manual visual inspection.

The recent posts on this thread have been about possible ideas for building a plugin tool for Calibre that *will* attempt to identify duplicates already in the library. What we haven't quite nailed down as yet before I start writing it is exactly how it might work, though I think we are iterating closer to that with recent posts. I've got some other plugins I want to finish developing first over the next week or two and then it will have more of my focus to get stuck into it.

In the meantime, your options are to (a) be patient

until we get the development done, (b) manual inspection which I agree isn't practical for a large library, or (c) use tools/scripts outside of Calibre to query against the database, some of which you will find in old threads on these forums if you search.

Quote:

Also does anyone know if there is a way to have Calibre work with titles in a better way? As in, currently there is the option of gathering based on meta data and file name.

Some of my files have meta and other don't and some would be better with the file name, since they don't contain any meta data but the file name is clear so right now it seems like in some cases simple windows search would do the trick better but since I don't want two copies of the library (large) I am stuck. If this makes sense.

I'd love to hear a better answer too but afaik the answer is no, there is no magic fairy dust option. Garbage in, garbage out as far as Calibre is concerned. If you can't rely on metadata in the file (which you can't if you import formats like TXT which have none, or use LN, FN author format and the metadata isn't matching that) then you have to switch that option off as I do and then it is all down to filename.

My personal workflow is:
(1) Use Duplicate File Finder (a free tool) to scan my input directories and Calibre before I do anything with the files. That lets me get rid of exact CRC duplicates without caring about any filename cleanups.
(2) Use a tool I hacked together in C# which lets me quickly slice and dice filenames in bulk to match exactly my Calibre add regular expression with various hotkeys.
(3) Using that same tool, do a "pre-add" to Calibre by querying the Calibre database and looking for matches on author and title (similar to Automerge algorithm). This moves the files into different import subfolders ready to add depending on whether they are a new book, a new format for an existing book or a duplicate format. If a duplicate format it will require a visual comparison to identify which version I want to keep. Although if the filesize differs within a low % of the existing Calibre book filesize then I will push it into a fourth folder for deletion (to allow for EPUBs with touched bookmarks etc).
(4) I also have special processing to take care of html folders of books, since they need to be added "one per folder" whereas the rest are "many files per folder"
(5) I go ahead and import the folders of "new books" and "new formats of an existing book", with automerge turned on. The "duplicates for deletion" folder gets tossed, and the "duplicate formats" folder has to be manually compared before I add one by one.
(6) I have additional screens in my tool which run various sql queries against the database that i do periodically to pickup duplicate authors or titles with various fuzzy logic algorithms.

Long winded - sure. But you hope to only do it once

. And in case you ask it no the tool I wrote isn't available, it is too specific to how I work and the code is true hack filth. In theory it could be rewritten to be a Calibre plugin if enough people thought it would be useful but that would be a load of work. I certainly hope parts of it will become deprecated like some of the duplicate comparison stuff thanks to the recent additions by Starson/Kovid for the next release and of course the duplicate finder plugin when it appears. However they still won't solve some of the fundamental issues of getting your filenames 100% correct before you add to Calibre.