02-08-2011, 02:41 PM | #61 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I'm guilty of the same thing. I often find that posting helps organize (organise if you prefer ) my thoughts. I hadn't really come to grips with the "best" way to handle the duplicate groups issue or the dialog vs. Library UI issue, but it looks like you're zeroing in on a good way to handle it all, without too much new code and too many new interfaces for user to learn.
|
02-08-2011, 02:58 PM | #62 | |
Wizard
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
|
|
Advert | |
|
02-09-2011, 04:58 AM | #63 |
Grand Sorcerer
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Kiwidude mentioned an email I sent him on this issue where I suggested using a custom column. Here is it, verbatim
Spoiler:
In further conversations with kiwidude, the question of false positives came up. My suggestion would be to permit the user to say that two (or more?) given books are not duplicates. This information would be used by the duplicate detector to ensure that those books never appear together in a duplicate-book partition. The performance of this check would be very good if set arithmetic is used. Something like (in pseudo-code) Code:
for each partition_set created by the dup finder for book_id in partition_set result_set = partition_set - books_are_not_dups[book_id] if result_set: create a real partition label for b_id in result_set: add partition label to the cust col for b_id |
02-09-2011, 09:11 AM | #64 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
That's very similar to my idea, except, I had in mind assigning a number for each dupe set, then highlighting odd numbered sets so that when sorted by the dupe set number, all members would be together for all dupe sets, and the highlighting would identify all dupe sets (the first set would be highlighted, the second even numbered set would not, the third odd numbered set would be highlighted, etc.) I wasn't sure of the best way to avoid repeat false positives. Simply removing a book from any future dupe groups has problems. You might want to prevent a book from appearing in one test because it was truly a false positive, but later want to run another type of dupe test that might truly find it as a dupe of some other book in this different test mode. You might add another book later that does match a book that was previously marked as a false positive. As usual - those are just random thoughts. I guess Kiwidude has his work cut out for him |
|
02-09-2011, 09:17 AM | #65 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I suspect that this would be a bit too aggressive. I've run many dupe tests on my library using various SQL queries. My experience is that you want to run many different tests, as each individual test may find some dupe books, but not others. I don't mind the time to run a second dupe test over my books, even if I'm pretty sure they aren't dupes. If they weren't found to be dupes the first time, they shouldn't show up as a false positive the second time. If they do, then I want to review and find out why it's been identified as a dupe this time, when it wasn't the first time.
Last edited by Starson17; 02-09-2011 at 09:51 AM. |
Advert | |
|
02-09-2011, 11:42 AM | #66 | ||||||||
Grand Sorcerer
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
|
||||||||
02-09-2011, 12:17 PM | #67 | |||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
If you show all books at once, some kind of divider is needed showing where each dupe set ends and the next starts. In working with my personal duplicate testing I found myself with lots of dupe sets, and having to do lots of searches - one for each group. It was a pain. Perhaps a quick key could be assigned to "show members of next duplicate set" but trying to actually type a new search for each new duplicate set would get repetitive. I suppose you're thinking of doing this via the tag browser, which I seldom use. Quote:
|
|||
02-09-2011, 12:26 PM | #68 | ||
Grand Sorcerer
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
Last edited by chaley; 02-09-2011 at 12:30 PM. Reason: typos |
||
02-09-2011, 03:06 PM | #69 | |
Wizard
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
Especially when running a query doesn't require excessive time. |
|
02-09-2011, 09:38 PM | #70 |
Enthusiast
Posts: 39
Karma: 10
Join Date: Jan 2011
Device: Nook Color
|
I've seen this thread several times while looking to workout the duplicate issue.
Then was again led here by kiwidude and dwanthny a little while ago. I've read much of what everyone has posted here and from what I gather there seems to be no real way of fighting duplicates once all of the books have been added. Am I understanding this right? Outside of manually going through everything yourself which in my case would be impossible. The other issue in my case is that I had the auto merge setting off for 80% or more of the time and towards the end of the adding process turned it on as per the suggestions of the two guys above. So this sort of complicates things eve more. Also does anyone know if there is a way to have Calibre work with titles in a better way? As in, currently there is the option of gathering based on meta data and file name. Some of my files have meta and other don't and some would be better with the file name, since they don't contain any meta data but the file name is clear so right now it seems like in some cases simple windows search would do the trick better but since I don't want two copies of the library (large) I am stuck. If this makes sense. Thanks everyone. Last edited by vitalichka; 02-09-2011 at 09:40 PM. Reason: some minor changes to clear things up |
02-09-2011, 10:57 PM | #71 | ||
calibre/Sigil Developer
Posts: 4,601
Karma: 2092290
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
The recent posts on this thread have been about possible ideas for building a plugin tool for Calibre that *will* attempt to identify duplicates already in the library. What we haven't quite nailed down as yet before I start writing it is exactly how it might work, though I think we are iterating closer to that with recent posts. I've got some other plugins I want to finish developing first over the next week or two and then it will have more of my focus to get stuck into it. In the meantime, your options are to (a) be patient until we get the development done, (b) manual inspection which I agree isn't practical for a large library, or (c) use tools/scripts outside of Calibre to query against the database, some of which you will find in old threads on these forums if you search. Quote:
My personal workflow is: (1) Use Duplicate File Finder (a free tool) to scan my input directories and Calibre before I do anything with the files. That lets me get rid of exact CRC duplicates without caring about any filename cleanups. (2) Use a tool I hacked together in C# which lets me quickly slice and dice filenames in bulk to match exactly my Calibre add regular expression with various hotkeys. (3) Using that same tool, do a "pre-add" to Calibre by querying the Calibre database and looking for matches on author and title (similar to Automerge algorithm). This moves the files into different import subfolders ready to add depending on whether they are a new book, a new format for an existing book or a duplicate format. If a duplicate format it will require a visual comparison to identify which version I want to keep. Although if the filesize differs within a low % of the existing Calibre book filesize then I will push it into a fourth folder for deletion (to allow for EPUBs with touched bookmarks etc). (4) I also have special processing to take care of html folders of books, since they need to be added "one per folder" whereas the rest are "many files per folder" (5) I go ahead and import the folders of "new books" and "new formats of an existing book", with automerge turned on. The "duplicates for deletion" folder gets tossed, and the "duplicate formats" folder has to be manually compared before I add one by one. (6) I have additional screens in my tool which run various sql queries against the database that i do periodically to pickup duplicate authors or titles with various fuzzy logic algorithms. Long winded - sure. But you hope to only do it once . And in case you ask it no the tool I wrote isn't available, it is too specific to how I work and the code is true hack filth. In theory it could be rewritten to be a Calibre plugin if enough people thought it would be useful but that would be a load of work. I certainly hope parts of it will become deprecated like some of the duplicate comparison stuff thanks to the recent additions by Starson/Kovid for the next release and of course the duplicate finder plugin when it appears. However they still won't solve some of the fundamental issues of getting your filenames 100% correct before you add to Calibre. |
||
02-10-2011, 07:47 AM | #72 | |
Grand Sorcerer
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Eyes glazing over ...
Quote:
An example implementation is under the spoiler. Spoiler:
This implementation depends on the fact that the items in the graph are numbers and can be fully ordered, so it is easy to prune duplicate paths simply by never traversing an edge to a node less than the one in hand. As the list is sorted, such an edge would already have been traversed in the other direction, so the algorithm does not need to deal with discovering both (1,2,3) and (2,3,1). |
|
02-10-2011, 08:08 AM | #73 |
calibre/Sigil Developer
Posts: 4,601
Karma: 2092290
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
|
02-10-2011, 08:26 AM | #74 | |
Grand Sorcerer
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
The interesting part for me is manipulating similarity data efficiently and accurately. That will be a small part of the plugin. The larger part, GUI and other 'look and feel' stuff, is far better left to someone else. |
|
02-10-2011, 09:08 AM | #75 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
It's far from perfect, and you might even lose some metadata where some records have good data and other matching records have bad data. It depends on the order of processing. You keep metadata only from the first record processed. The formats for matching records are added to that first record, but no metadata is added. If you've got garbage for author/title, it won't do much good. |
|
Tags |
duplicate |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |
Device Detection doom | Alberto Franches | Calibre | 6 | 06-24-2010 05:38 PM |
Device detection? | totanus | ePub | 1 | 12-17-2009 07:05 AM |
Structure detection v5.5 and v6.2 | AlexBell | Calibre | 2 | 07-29-2009 10:11 PM |