Hey Charles, thanks for the replies.
I don't think we differ at all on thoughts on how the data should be partitioned, and that a set based approach to viewing the results has a number of advantages over book based in terms of simplicity of the UI etc.
Where to store the duplicate pair exemption is an interesting one. Sounds like we are agreed that the algorithm is not relevant to the pairing. The maintenance aspect is a point I touched on in a slightly different fashion, I don't believe you would need to store it on both books if you process in book id order but there is certainly an issue when one of the books in the pair becomes merged or deleted. Now we have library uuid it is more feasible to store library specific ids in a config file so I'll add that to the mix.
I'm not sure I understand all your persistence reasons, as surely similar justification for not using a custom column to store exemptions applies to storing groups? Take for instance the example of (1,2) and (1,3) being found. We display (1,2). Now if the user goes "next group" without doing anything, we will want to display (1,3). If the user merges book 2 into book 1 before going next group, we will still want to display (1,3). However if they merge book 1 into book 2, then the (1,3) 2nd group is invalidated. If that information has been persisted into a custom column, the plugin potentially will be doing a lot of repeated querying to find a next group, validate all the ids, clearing the persisted column if group is not valid, move to next group etc. I would have thought that might be a little easier to do when not persisted in the database?
For point (3) of viewing groups a book is in, you can still do that. You would always have to rely on some kind of in-memory mapping the plugin kept of groups and their members. So I don't see that reason as a differentiator - in fact it could well be easier just using in-memory results because you don't have to re-read all the custom group ids and re-build the mappings as you would if relying on the persisted values?
I think the point (2) of doing the resolving across Calibre sessions is absolutely valid, though I think it comes down to the workflow again and how long it takes to run the duplicate search. For me I either resolve the pair in some way or else it will come up again the next time I run the search. Whether I do that now or the next time I open Calibre (and trigger another search) to me doesn't matter, that pair still needs to be resolved in some way. But of course thats just my opinion

.
I think at this point I will still go for the "marked" memory based approach and see how well it works. Converting it to use persisted custom columns at a later point should only involve adding to the logic I would have thought, not having to bin everything. If I also store the false positives in the config file then that means I can avoid custom columns completely. I would like to avoid them if I can, as it just avoids the issues of column visibility, deletions, attempted sorting, naming conflicts, users fiddling with values etc.
It is a good point about using a custom column for transitive based sorting. Though unless you forced that column visible on the view one stray click by the user and the sorting is lost, they would like have to do "next group, previous group" to get it restored. However at least for the first cut since transitive groups are not being displayed I think I can again avoid this issue for now anyways.