Playing around with this now - the ignore title rocks! Haven't played with the tag browser enough to comment on that. I agree with Chaley that this is more or less ready to release as is.
Soundex was really helpful too - I'm not sure if letting users tweak the fuzziness would help much, unless you're talking about making it less fuzzy - while it's quite useful for finding issues the other algorithms miss it does have a higher number of false positives.
I noticed you mentioned an issue with non-ascii in Soundex earlier - there is already a function in Calibre to convert a non-ascii character to it's ascii equivalent - are you using this already? I noticed Soundex caught
China Miéville vs.
China Mieville while the other algorithms missed this. Though thinking out loud doing this ascii downgrade any time you detect non-ascii for the purposes of comparison could be useful.
This might be an advanced/too specialized option, but I keep multiple version of book records, but general only one record that's 'published' to OPDS/Externally accessible library instances, etc. I do this by adding a tag 'Nopub' to the ones I don't want published. I'd rather do this than merge book records and risk having a faulty version overwrite a good version during conversion/merges etc. The faulty versions I keep around for conversion testing or just because I haven't gotten around to fully comparing the editions.
Anyway the thought behind the request is to automatically exempt sets of dupes where all but one in the set have some specific/configurable tag.
Other feedback:
- Keyboard shortcut for exempting a group would be extremely helpful
- Things seem to go a bit wonky when you reach the last set, at least with ignore title searches. After finishing all/most of the original sets it the 'next set' function began jumping all over the place and highlighting things that weren't really sets. I didn't even realize I was done until it started acting strange and I initiated a fresh search which returned no results.
edit: the non-ascii to ascii equivalent function is get_udc.decode() from calibre.utils.localization