MobileRead Forums - View Single Post

Starson17 · 02-11-2011, 09:12 AM

Quote:

Originally Posted by chaley

Doesn't that mean that you must spend brain cycles deciding again that 2 and 3 are not dupes? I can see why you might want that, but then one must work through rather carefully the notion of 'false positive'.

Not if you think of this as "Show me all books that may be duplicates of Book 1." I don't have to think about anything except possible matches to Book 1. There's a possible duplicate set (1, 2) and another (2, 3), so If I work through the books in book order, and I'm working on Book 1 matches, I only have to decide if Book 1 is a match of 3 and not if Book 2 matches 3.

When examining Book 2, I would have to decide if 2 matches 3. There is no (1,2) match, so it doesn't show on Book 2. With luck, there are no other matches of Book 2 and nothing more appears for that book. We're also done for Book 3, since we've finished the (1, 3) and (2, 3) checks when doing Books 1 and 2 (assuming no other matches for Book 3).

Quote:

Regarding transitivity, consider the following. Assume:
- a test that matches if two books contain one title word in common and 1 author in common.
- a book 'Ectoplasm' by Joe Blogs (book 1)
- a book 'Auras' by Patricia Posts (book 2)
- A book 'Ectoplasm and Auras' by Joe Blogs and Patricia Posts (book 3). This is an omnibus edition.

The test will identify books (1,3) and (2,3) as potential dupes. Transitivity would give us (1,2,3), which is clearly wrong, as 1 and 2 are definitely not dupes of each other. I am ignoring further levels transitivity, which would expand the set even more.

I had in mind review in Book order (or selected order, but still by book) showing only (1, 3) for Book 1, and marking that as not a match or merging them. Book 2 is nowhere a match of Book 1 so it would not be shown when examining Book 1. Book 2 is a possible match of Book 3, so we see (2,3) for Book 2. For Book 3, we see nothing, as the two possible matches (1, 3) and (2, 3) have already been resolved. If we started with Book 3 ('Ectoplasm and Auras' by Joe Blogs and Patricia Posts'), however, then we would have seen (1, 2, 3) and resolved the matches for only Book 3. That would have resolved the (1,3) and the (2,3) matches in a single shot. There are no other matches and we're done. Nothing shows up for Book 1 or Book 2. I agree, for this to work well, we need to know what book is under consideration.

Your model is to do this set by set, instead of book by book. In set by set, there is no "book under consideration" so we can't (shouldn't) show (1, 2, 3) for the reasons you elegantly explained.

Quote:

The question then becomes which is better, showing all three which might help identifying the omnibus but requiring some thought to ignore the (1,2) pair, or showing (1,2) (1,3) which shows the information the test actually found (and avoids the transitive closure problem). I don't have an answer. My guess is that this will come to personal preference. Joy to the GUI man.

I see an advantage to showing what was actually found - (you wrote "(1,2) (1,3)" but from the example it should have been "(1,3) (2,3)"), but it only makes sense to me to show that if we're doing a book by book review and have a specified "book under consideration." If we're doing set by set, then it can explode if Book 2 matches 4 and 4 matches 7,8,9 and ....etc.

Without having played with it, or actually used any code, I lean towards your set-by-set approach, but I was just throwing out what was in my mind from the transitive model (which is also book-by-book) based on the automerge code. Perhaps both could be tested or even added as options.

Also, we've barely discussed what to do with multiple matching functions, which I suspect will need to be handled. If one matching function is author/title based and I mark (2, 3) as "Not Duplicates", then later use a "Find all identical ISBN numbers" as a new matching function, should a (2, 3) match be ignored, even if they have identical ISBN numbers?

I ran into some of these problems when writing my personal duplicate finder SQL code. There's a reason we haven't gotten a good duplicate finder yet

As you said - Joy to the GUI man!