02-10-2011, 09:38 AM | #76 | ||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I know. The reason I didn't consider the possibility that a book could be in multiple sets was because I assumed we'd use the same process used for automerge. It processes the author and/or title to produce a character string. Then identical character strings are grouped. You only get one character string per book, so a book can only be in one set of identical books. When you run a different function against the author/title, you get a different character string and different matching sets, but again, you get only one.
I'll admit, I haven't quite grokked how you will generate multiple sets for a single book. Are you thinking of different runs at different times and storing the results or multiple matching functions run during the matching process ... or what? Quote:
Quote:
The latter type of "next" set only occurs if the matching process permits books to be members of more than one set. I'm still not convinced that we need to allow that at a single point in time. Clearly we need it for different runs (Run 1 match author/title using the automerge function and show duplicate sets, Run 2 do soundex matching of title only, Run 3 do soundex matching of author and exact match title, etc.) but do we need to do all three runs and store the results at the same time? Would it not be sufficient to do the runs individually for each matching function? |
||
02-10-2011, 09:50 AM | #77 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
If the search functions for duplicates include soundex functionality (similar sounding names - fuzzy matching) that isn't implemented in the search bar, we may want to be able to disable the false positive removal, or implement the duplicate finding functions in the search bar. I know that at some point I'm going to find a group of near duplicates that I don't want to merge and do want to eliminate from further duplicate searches, but which I later want to find as a group simply because I remember I found that group once before and I want to see it again. |
|
02-10-2011, 09:53 AM | #78 | ||
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
Mine is, roughly speaking, that the user requests that some tests be run. These are all run together, producing sets of candidate duplicates. Depending on the fuzziness of the matches, a book can be in more than one set because fuzzy matching isn't transitive (f we have (a matches b) and (b matches c), there is nothing that says that (a matches c)). I don't think that we should force transitivity, so by extension I don't think we should disallow books in multiple sets. The next step is to ensure that known/declared not-duplicates are removed from the sets. This removes known false positives. This will by necessity produce new sets. For example, assume that the test returns books (1,2,3). Assume further that books (2,3) are known to not be duplicates. To remove the false positive but keep the information the test produced, we must partition (1,2,3) into (1,2) and (1,3). Thus, we have two ways to get the same book into different duplicate sets: non-transitive operations and known duplicate removal. You have introduced a third: the kind of test. I am not sure about the usefulness of this. Do I really care how the potential duplicate was found, again assuming I can remove false positives? If the answer is yes, then I suggest that the different tests use different custom columns, thereby separating the results. |
||
02-10-2011, 10:01 AM | #79 | ||
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
I should point out that as it is, search is not capable of comparing a given book against all books in the library. Some serious work would be required to be able to ask the question "find all books that are like this one". If the fuzzy searches are invertable (can be determined from book data), then I can see generating a fuzzy-search expression that produces a list of matches. However, if the fuzzy searches are one way, where some algorithm is applied and some number of books 'win', then things are much more interesting. Quote:
|
||
02-10-2011, 12:48 PM | #80 | ||||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
The first was the current automerge matching model, which is transitive. An incoming title is processed by the matching function to produce a match pattern. A candidate matching title is processed by the same matching function. If the result for that title matches the match pattern exactly, they are duplicates. a=b and b=c implies all three produce the same match pattern, so a=c. Implementing this easily allows global simultaneous review of all sets with dividers or highlighting to separate groups. A book is only in one set - the set that matches the match pattern for that book. Quote:
Quote:
I was thinking I'd ask to see a first list of books to review for possible dupes: First approach - book based: I'd see the set (1,2) and decide if 1 and 2 were dupes. I'd then press "next set," see (1,3) and decide if 1 and 3 were dupes. Now I'd press "next set" and see (4, 8). Note that I'm looking at match sets in the order of the books - 1, 2, 3, 4, etc, skipping books that have no matches and any sets previously considered. Note also the step between showing books in the sets that include book 1 versus the step to another totally unrelated set (4, 8) that I thought might be useful to signal. - The two types of "next set" I mentioned. Second approach - still book based: When asking to see the first set of matching books, why not show me the set (1, 2, 3)? Yes, 2 and 3 are not dupes, but I'm not sure if that's useful when showing the books that match book 1. I still need to see if 1 matches 2 and 3. Is it better to do it in two stages or in one? In the first (automerge) transitive model and the second of the two other approaches above, there is only one set per book. In all three, one is doing a book based review. "Show me books that match this book" In a set based review, one must consider all the cross links. The number of decisions for set-based review is the number of combinations of two books selected from the match set. For example: in a set of 8 members, I need to make 28 decisions (is the third book the same as the last book, is the fifth book the same as the sixth, etc.). With a book based approach, I need to make only eight decisions (is any book in the set a duplicate of the book under review). In set based you have a large number of combinations to consider for each set and you have multiple sets for each book, but fewer total sets to analyze. In book based, you have fewer decisions for each set, and you can collapse the sets for that book, if you wish, but you have many more sets to review. As usual, it's just random thoughts - I've got no certainty as to what would work best. |
||||
02-11-2011, 02:56 AM | #81 | |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Regarding transitivity, consider the following. Assume: - a test that matches if two books contain one title word in common and 1 author in common. - a book 'Ectoplasm' by Joe Blogs (book 1) - a book 'Auras' by Patricia Posts (book 2) - A book 'Ectoplasm and Auras' by Joe Blogs and Patricia Posts (book 3). This is an omnibus edition. The test will identify books (1,3) and (2,3) as potential dupes. Transitivity would give us (1,2,3), which is clearly wrong, as 1 and 2 are definitely not dupes of each other. I am ignoring further levels transitivity, which would expand the set even more. The question then becomes which is better, showing all three which might help identifying the omnibus but requiring some thought to ignore the (1,2) pair, or showing (1,3) (2,3) which shows the information the test actually found (and avoids the transitive closure problem). I don't have an answer. My guess is that this will come to personal preference. Joy to the GUI man. Last edited by chaley; 02-11-2011 at 08:52 AM. |
|
02-11-2011, 03:05 AM | #82 | |
Wizard
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
So, at the moment I would be extremely happy if I got result (1,2,3). I would have to go through results anyway and this would be *much* quicker than going through entire collection author after author (checking for the fuzzines in the author name (that is King Stephen; Stephen King; S. King; King, S.; S KING ...)) |
|
02-11-2011, 04:44 AM | #83 | |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Also, my experience has shown me that, counter to the RAD religions, having an idea of where one wants to go simplifies life. It is usually hard to unwind a choice, especially ones that have UI and architecture consequences, so thinking a bit about about end points and trajectory at this point is good. Last edited by chaley; 02-11-2011 at 08:50 AM. Reason: Correct the 'natural result' |
|
02-11-2011, 09:12 AM | #84 | |||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
When examining Book 2, I would have to decide if 2 matches 3. There is no (1,2) match, so it doesn't show on Book 2. With luck, there are no other matches of Book 2 and nothing more appears for that book. We're also done for Book 3, since we've finished the (1, 3) and (2, 3) checks when doing Books 1 and 2 (assuming no other matches for Book 3). Quote:
Your model is to do this set by set, instead of book by book. In set by set, there is no "book under consideration" so we can't (shouldn't) show (1, 2, 3) for the reasons you elegantly explained. Quote:
Without having played with it, or actually used any code, I lean towards your set-by-set approach, but I was just throwing out what was in my mind from the transitive model (which is also book-by-book) based on the automerge code. Perhaps both could be tested or even added as options. Also, we've barely discussed what to do with multiple matching functions, which I suspect will need to be handled. If one matching function is author/title based and I mark (2, 3) as "Not Duplicates", then later use a "Find all identical ISBN numbers" as a new matching function, should a (2, 3) match be ignored, even if they have identical ISBN numbers? I ran into some of these problems when writing my personal duplicate finder SQL code. There's a reason we haven't gotten a good duplicate finder yet As you said - Joy to the GUI man! Last edited by Starson17; 02-11-2011 at 09:50 AM. Reason: Fixed some numbering errors |
|||
02-11-2011, 09:35 AM | #85 | ||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
In set by set, you have to go through the number of sets that have been found. In book by book you have to go through the number of books that appear in any set. The first has fewer items in the list to review (match sets), but more decisions for each item. The second has more items to review (books in at least one set), but fewer decisions for each item. In the first, (IMHO) you can't collapse items as effectively as in the second. My example was for a set of 8 books. In set by set, you have to make 28 decisions to decide if any book in that set is a match of any other book in the set. In book by book, you look at the first book, then decide if any of the remaining 7 books match. You don't have to decide if the second book is a match of the fifth book, but you can, if you wish. If you merge any books, that merge takes the merged book out of the book list to review, by removing that book from all remaining sets. Last edited by Starson17; 02-11-2011 at 09:48 AM. |
||
02-11-2011, 09:44 AM | #86 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
However, for book by book review, when reviewing Book 1, you only add in books that have been found to match Book 1. Every decision you make is a decision about a match that was actually found by the match function. You only collapse match sets that include Book1 and only review Book 1 before moving on to Book 2. Of course, while reviewing Book 1, you are very likely to merge books (1, 3, 4) and when you do that, you automatically remove the need to ever review Books 3 and 4. They are removed from all match sets. |
|
02-11-2011, 09:50 AM | #87 | ||
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
To test my understanding, does the following make sense? Assume that I have done a set-oriented test, and now have a bunch of sets. If I was viewing by book, then when I ask 'show me matches for book X', I would show at one go all the sets containing X. This is a set union, not a transitive closure. In the example above, asking for matches of book 1, I would see book 3. Asking for matches of book 2, I would see book 3. Asking for matches of book 3, I would see books 1 and 2. This is probably a very useful alternate visualization of the data. I don't think it would be hard. All we would need would be a book -> set map. I also think that duplicate processing would happen when building the sets, but would not happen when merging the sets (doing the union). I think that this would give the answers very close to what you describe in your second post (the more detailed example). Quote:
My guess is that duplicate processing must be optional. Fortunately, this isn't hard. Just don't do the post-pass. |
||
02-11-2011, 09:55 AM | #88 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
It's worth commenting that none of this relates to what kiwidude said might be in version 1 - use the existing automerge matching function code.
Automerge produces results where you can't get any of the problems discussed here. Quote:
|
|
02-11-2011, 10:04 AM | #89 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I was vaguely thinking of some sort of "Next Book" function with a display (Highlighted? Appears in dialog popup window? Is listed first in a list with other match members?) of the "Book under consideration." At some point I have to think about every book that's a possible match of every other book. It just seemed to me that doing them in some order was easier then trying to think about all possible cross matches within any given match set. Obviously, I am going to think about those cross matches, and if I do any merges (or we can think of some way to mark some cross matches I'm sure are not duplicates) then we can use that info to reduce the number of additional review steps by removing those books or known non-matches from the remaining books to be reviewed. Last edited by Starson17; 02-11-2011 at 10:08 AM. |
|
02-11-2011, 10:24 AM | #90 | |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Hmmm... Perhaps making it a search restriction would be even better. That would facilitate subsearches and subsorts while permitting going back to the original order. I am suspicious of building a navigation dialog box. My feeling is that there would be a lot of pressure to make it as capable as the library view, including all the metadata edit features. On the other hand, if the dialog shows minimal information but is capable of scrolling the library view to the book (in effect, searching for id:nnnn with highlight turned on), then the pressure might be avoided. |
|
Tags |
duplicate |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |
Device Detection doom | Alberto Franches | Calibre | 6 | 06-24-2010 05:38 PM |
Device detection? | totanus | ePub | 1 | 12-17-2009 07:05 AM |
Structure detection v5.5 and v6.2 | AlexBell | Calibre | 2 | 07-29-2009 10:11 PM |