Register Guidelines E-Books Search Today's Posts Mark Forums Read

 04-11-2011, 08:22 AM #3 chaley "chaley", not "charley"   Posts: 7,659 Karma: 1971516 Join Date: Jan 2010 Location: France Device: Many android devices not duplicates We must first decide what it means to say two books are not duplicates. My position is that books marked as not duplicates are not duplicates, independent of what algorithm is used. Note that the 'not duplicate' relationship is not transitive: if books 1 & 2 are not duplicates and books 2 & 3 are not duplicates, we can say nothing about whether books 1 and 3 are duplicates. I think that the natural implementation is to have pairs of books that are known to be 'not duplicates'. When sets of duplicates are built, they must be partitioned so that pairs of 'not duplicates' are never in the same set. Using the above example, if the algorithm produces a set [1, 2, 3], then partitioning will produce a set [1, 3]. If we have a duplicate set [1, 2, 3, 4] with not-duplicate [1, 2] and [3, 4], then the result will be four sets [1, 3], [1, 4], [2, 3], [2, 4]. The algorithm I posted before does this partitioning. I am not convinced that a custom column is the best place to keep the 'not duplicate' pairs. First, the data would be duplicated in both books, creating a maintenance problem (change one means the other must change). Second, the data must be 'multiple' because a book can participate in multiple pairs. I think I am saying that the plugin must accept a selection of books that are not dups and write that information to its own storage. As for whether or not to use a persistent column for the results: I tend toward persistence. I see three advantages of using a custom column. 1) I can process a group, then delete the group marker to indicate I am done. Running the algorithm again could find these books again, unless I am meticulous about marking not-dups. 2) I can do the work over time, knowing where I stopped. 3) I can select all the groups that a given book is in, giving a books view. For this last one, the plugin could help by writing the transitive group information into a custom column. This would solve the sorting problem because the values written could be sortable: the base book could be #1, with the rest of the books being some other number. The problem you raise regarding books 2 and 3 'not duplicates' is also real, but I think it is a consequence of using book mode. Book-based browsing says something about a relationship between the base book and any other book, but nothing about the relationship between books that are not the base. Have fun!
 04-11-2011, 09:42 AM #4 kiwidude calibre/Sigil Developer   Posts: 4,246 Karma: 1445802 Join Date: Oct 2010 Location: London, UK Device: Kindle Paperwhite 3G, iPad 3, iPad Air Hey Charles, thanks for the replies. I don't think we differ at all on thoughts on how the data should be partitioned, and that a set based approach to viewing the results has a number of advantages over book based in terms of simplicity of the UI etc. Where to store the duplicate pair exemption is an interesting one. Sounds like we are agreed that the algorithm is not relevant to the pairing. The maintenance aspect is a point I touched on in a slightly different fashion, I don't believe you would need to store it on both books if you process in book id order but there is certainly an issue when one of the books in the pair becomes merged or deleted. Now we have library uuid it is more feasible to store library specific ids in a config file so I'll add that to the mix. I'm not sure I understand all your persistence reasons, as surely similar justification for not using a custom column to store exemptions applies to storing groups? Take for instance the example of (1,2) and (1,3) being found. We display (1,2). Now if the user goes "next group" without doing anything, we will want to display (1,3). If the user merges book 2 into book 1 before going next group, we will still want to display (1,3). However if they merge book 1 into book 2, then the (1,3) 2nd group is invalidated. If that information has been persisted into a custom column, the plugin potentially will be doing a lot of repeated querying to find a next group, validate all the ids, clearing the persisted column if group is not valid, move to next group etc. I would have thought that might be a little easier to do when not persisted in the database? For point (3) of viewing groups a book is in, you can still do that. You would always have to rely on some kind of in-memory mapping the plugin kept of groups and their members. So I don't see that reason as a differentiator - in fact it could well be easier just using in-memory results because you don't have to re-read all the custom group ids and re-build the mappings as you would if relying on the persisted values? I think the point (2) of doing the resolving across Calibre sessions is absolutely valid, though I think it comes down to the workflow again and how long it takes to run the duplicate search. For me I either resolve the pair in some way or else it will come up again the next time I run the search. Whether I do that now or the next time I open Calibre (and trigger another search) to me doesn't matter, that pair still needs to be resolved in some way. But of course thats just my opinion . I think at this point I will still go for the "marked" memory based approach and see how well it works. Converting it to use persisted custom columns at a later point should only involve adding to the logic I would have thought, not having to bin everything. If I also store the false positives in the config file then that means I can avoid custom columns completely. I would like to avoid them if I can, as it just avoids the issues of column visibility, deletions, attempted sorting, naming conflicts, users fiddling with values etc. It is a good point about using a custom column for transitive based sorting. Though unless you forced that column visible on the view one stray click by the user and the sorting is lost, they would like have to do "next group, previous group" to get it restored. However at least for the first cut since transitive groups are not being displayed I think I can again avoid this issue for now anyways.
04-11-2011, 10:17 AM   #5
chaley
"chaley", not "charley"

Posts: 7,659
Karma: 1971516
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
 Originally Posted by kiwidude I'm not sure I understand all your persistence reasons, as surely similar justification for not using a custom column to store exemptions applies to storing groups? Take for instance the example of (1,2) and (1,3) being found. We display (1,2). Now if the user goes "next group" without doing anything, we will want to display (1,3). If the user merges book 2 into book 1 before going next group, we will still want to display (1,3). However if they merge book 1 into book 2, then the (1,3) 2nd group is invalidated. If that information has been persisted into a custom column, the plugin potentially will be doing a lot of repeated querying to find a next group, validate all the ids, clearing the persisted column if group is not valid, move to next group etc. I would have thought that might be a little easier to do when not persisted in the database?
I assume that one stores the group ID in the custom column, not pair relationships. I also assume that if one book is merged into another, then one of those two books is deleted.

If both of these assumptions are true, then your examples don't create a problem. In the first case (2 -> 1), book 1 will still have all its groups, so next_group will show (1,3). If 1 is merged into 2, then 1 no longer exists and next_group will show only book 3.
Quote:
 For point (3) of viewing groups a book is in, you can still do that. You would always have to rely on some kind of in-memory mapping the plugin kept of groups and their members. So I don't see that reason as a differentiator - in fact it could well be easier just using in-memory results because you don't have to re-read all the custom group ids and re-build the mappings as you would if relying on the persisted values?
I think we must have a different mental model of what is written in the custom column. I am assuming that the CC is a multiple column that is filled in with the group ID of each group the book is a member of. I don't see why I need to rebuild anything. If I search for group 34 (which I can do from the tag browser), I get all the books in group 34. I can resolve duplicates manually by removing a book from a group. The plugin could help with that, if it knows what group I am working on.

I also assume that telling the plugin that two books are not duplicates does not rerun the partitioning algorithm. The new non-dup info would be used next time I run it.
Quote:
 I think the point (2) of doing the resolving across Calibre sessions is absolutely valid, though I think it comes down to the workflow again and how long it takes to run the duplicate search. For me I either resolve the pair in some way or else it will come up again the next time I run the search.
I agree
Quote:
 Whether I do that now or the next time I open Calibre (and trigger another search) to me doesn't matter, that pair still needs to be resolved in some way.
I can easily imagine running the check, scanning the results over time, and merging/fixing any problems without doing any other resolution. Next time I run the check (which could be in months), I see the same results. If I get annoyed enough, I would go to the trouble of indicating that books in a group are not in fact duplicates.
Quote:
 It is a good point about using a custom column for transitive based sorting. Though unless you forced that column visible on the view one stray click by the user and the sorting is lost, they would like have to do "next group, previous group" to get it restored. However at least for the first cut since transitive groups are not being displayed I think I can again avoid this issue for now anyways.
I assumed the column was visible.

 04-11-2011, 10:35 AM #6 kiwidude calibre/Sigil Developer   Posts: 4,246 Karma: 1445802 Join Date: Oct 2010 Location: London, UK Device: Kindle Paperwhite 3G, iPad 3, iPad Air Hey Charles, no we are on the same page when it comes to what is stored in the CC.. What I thought would not be useful as a user though is to get shown a duplicate group that had only one entry in it when I go "next result". So just displaying book 3 by itself wouldn't give me a high level of confidence the plugin was "working right", even though we know the reasons why there is just now one in the group . It is that extra processing behind the scenes to invalidate that group etc that was my concern of the CC versus in-memory approach. For instance it is going to be impossible to give the user an updated number of how many more "real" duplicate groups there are, unless you go through and invalidate them every time they hit next result, to remove all deleted books and reset the group counts. Something that could be done in memory less expensively/easily I would argue than via a CC. Because there are no hooks for this plugin to know that a user has deleted a book or merged a book, it must check each id in the duplicates space every time you hit next to give that count. You are of course correct that you wouldn't need to rebuild the mappings to display the book view so thanks for correcting me on that. Brain fart. And about making a column visible. Yeah that's where you get into the problem of a column being there that you only want while you are resolving duplicates. And the myriad of issues about ways to hide it etc. If you get sidetracked while resolving and want to do other queries/restart Calibre etc and have this column in your view that "annoys" when not being used. If I can avoid it then I would prefer to
04-11-2011, 10:43 AM   #7
chaley
"chaley", not "charley"

Posts: 7,659
Karma: 1971516
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
 Originally Posted by kiwidude What I thought would not be useful as a user though is to get shown a duplicate group that had only one entry in it when I go "next result". So just displaying book 3 by itself wouldn't give me a high level of confidence the plugin was "working right", even though we know the reasons why there is just now one in the group .
If the user is using the plugin's next group function, then all you need to do is a search_getting_ids and display the group iff the count is greater than 1. If the user is using the tag browser, then they see the counts change as they manually resolve issues.

The above aside, as you are doing the work, your vote is worth more than mine. In addition, given the quality of all your other work, I am sure that whatever you do will do the job admirably. Have at it!

 04-11-2011, 10:56 AM #8 kiwidude calibre/Sigil Developer   Posts: 4,246 Karma: 1445802 Join Date: Oct 2010 Location: London, UK Device: Kindle Paperwhite 3G, iPad 3, iPad Air Haha, dunno about the quality part mate, I've had a couple of crappy bugs with all the plugin rewrites recently, trying to context switch too often and Eclipse/Pydev has decided to not be of any use to me any more at indicating missing imports etc. May as well be writing in Notepad again. Might be time to try Aptana which John says has less issues. I enjoy the debate to verify I'm not wandering too far down the wrong path. You guys inevitably suggest things I haven't considered. For instance just now you mentioned the tag browser. Using a custom column would of course let you "navigate" your groups that way, and as you say give you an ongoing count of the groups which is something I hadn't figured out how to do other than via the status bar. I think that the group count is invalid unless you go through and "prune" (clear) group ids for books that no longer have more than one member though, and there has to be some action by the user to trigger that unless it is hooked into delete events (which as I discovered with a previous plugin and discussed here before don't get fired by database2.py anyways with that notify stuff). But a starting point is better than nothing, and I reckon 90% of it should be reusable should we steer down another path once people start feeding back on usability etc.
 04-11-2011, 08:18 PM #9 kiwidude calibre/Sigil Developer   Posts: 4,246 Karma: 1445802 Join Date: Oct 2010 Location: London, UK Device: Kindle Paperwhite 3G, iPad 3, iPad Air v0.1.1 beta Ok, here is the progress so far. The big ticket missing at this point is handling false positives. However the rest of it is pretty much there. So far I've put in five different algorithms:ISBN exact match Ignore author, exact title (Calibre add default logic) Ignore author, similar title Exact author, similar title (Calibre automerge logic) Similar author, similar title (Author using Kovid's new metadata logic) It returns the matches, and you can navigate through them either by clicking on the toolbar button (ctrl+click on the button for previous) or using keyboard shortcuts (ctrl+\ for next result, ctrl+shift+\ for previous). Groups are dynamically removed every time you move to the next result. False positives are my next issue. The first part which perhaps Chaley has thoughts on is deciding how to store them. For instance the simple case is that a user marks (1,2) as not dups, then (1,3) as not dups. So I could store those as tuples in a list for the library like [(1,2),(1,3)] However in theory we could allow the user to select more than 2 rows. For instance they select (1,2,3). Now what does that actually "mean". Presumably the net effect is indicating [(1,2),(1,3),(2,3)]. Do we break it down and store it that way, or do we store it as [(1,2,3)]? Now what happens if the user then selects (1,2,3,4)? As obviously that is a superset. Or what if they select (1,2,5)? I haven't got my head around whether we should flatten this out yet etc. Presumably we also need to cater for the user screwing up and allowing them to "undo" a no duplicate set. And how could we show the user visually what duplicate exemptions they have in place? That might need a special screen being built for it I think (and let them delete exemptions from that screen). Feedback as always appreciated, be it on the code, usability or whatever. So far it is fairly simple I think. Last edited by kiwidude; 04-12-2011 at 11:25 AM. Reason: Removed attachment, see later posts
 04-11-2011, 09:11 PM #10 tomd365 Junior Member   Posts: 6 Karma: 10 Join Date: Mar 2010 Device: PRS 505 I used this tonight on my collection and it worked very well. If the program found more than two matching books, it would hang after merging. Other than that it worked great - what a timesaver.
 04-12-2011, 02:55 AM #11 kiwidude calibre/Sigil Developer   Posts: 4,246 Karma: 1445802 Join Date: Oct 2010 Location: London, UK Device: Kindle Paperwhite 3G, iPad 3, iPad Air @tomd365 - thanks for trying this and the feedback. The "hanging" sounds a concern though. I have made two changes - one was a fix to the logic doing the cleaning up when you move to the next book, and the second is an improvement to the "xxx author, similar title" algorithms to strip whitespace from the fuzzied result, so that titles of "Manichu" and "Manichu -" would end up in the same group. If you can give me some exact steps to replicate any hanging you experienced it would be much appreciated. It wasn't very clear from your post. Did you mean more than two matching groups, or more than two books in a group? If the latter - how many groups, just one or multiple? What merges did you do - all of them required to get down to one record or just a couple? When did it actually "hang" - when you did next record, while you were merging or when you ran the duplicate check again? Last edited by kiwidude; 04-12-2011 at 03:15 AM. Reason: Fix some shocking grammar.
04-12-2011, 07:18 AM   #12
chaley
"chaley", not "charley"

Posts: 7,659
Karma: 1971516
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
 Originally Posted by kiwidude False positives are my next issue. The first part which perhaps Chaley has thoughts on is deciding how to store them. For instance the simple case is that a user marks (1,2) as not dups, then (1,3) as not dups. So I could store those as tuples in a list for the library like [(1,2),(1,3)] However in theory we could allow the user to select more than 2 rows. For instance they select (1,2,3). Now what does that actually "mean". Presumably the net effect is indicating [(1,2),(1,3),(2,3)]. Do we break it down and store it that way, or do we store it as [(1,2,3)]? Now what happens if the user then selects (1,2,3,4)? As obviously that is a superset. Or what if they select (1,2,5)? I haven't got my head around whether we should flatten this out yet etc.
The answer to the question "how do I store" will depend on how you process them and how you do partitioning. For example, the algorithm that I posted a while back wants a map of sets, mapping an ID to a set of IDs that are known not duplicates. This map would be a good way to save the information, in which case when given a list of not-dups, you would iterate over that list adding all the members to each set.
Code:
```  for bk in not_dups:
nd_map[bk] |= not_dups```
Whether or not you trim bk out of not_dups depends (again) on whether the algorithm is sensitive to the set containing itself.

Editing the not_dups list is mostly a presentation issue. The easiest way to do it (but perhaps not the easiest way to use it) would be to select a book and see the list of not_dups for that book (the value of nd_map[bk]). The user could check (or del or something) any book from that set, say with id bk2. The code would do:
Code:
```for bk2 in checked_books:
nd_map[bk] -= set([bk2])
nd_map[bk2] -= set([bk])```
The tricky thing is transitivity. If I say (1,2) are not_dups, then I say (1,3) are not_dups, the algorithm cannot generate (2,3) as a consequence. The map approach respects this restriction, as it never does a closure. To use your example and the above approach: (1,2,3,4) would generate
{ 1:(1,2,3,4), 2:(1,2,3,4), 3:(1,2,3,4), 4:(1,2,3,4) }
Then adding (1,2,5) would result in:
{ 1:(1,2,3,4,5), 2:(1,2,3,4,5), 3:(1,2,3,4), 4:(1,2,3,4), 5:(1,2,5) }

04-12-2011, 10:45 AM   #14
Starson17
Wizard

Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
 Originally Posted by kiwidude Feedback as always appreciated, be it on the code, usability or whatever. So far it is fairly simple I think.
FYI, all options in Find Duplicates give me a zero division error.
Spoiler:
Quote:
 calibre, version 0.7.53 ERROR: Unhandled exception: ZeroDivisionError:integer division or modulo by zero Traceback (most recent call last): File "calibre_plugins.find_duplicates.action", line 101, in find_duplicates File "calibre_plugins.find_duplicates.duplicates", line 215, in run_duplicate_check File "calibre_plugins.find_duplicates.duplicates", line 102, in run_duplicate_check ZeroDivisionError: integer division or modulo by zero

This is my small test library and has a lot of funky data from hundreds of odd tests.

 04-12-2011, 11:08 AM #15 kiwidude calibre/Sigil Developer   Posts: 4,246 Karma: 1445802 Join Date: Oct 2010 Location: London, UK Device: Kindle Paperwhite 3G, iPad 3, iPad Air Hi Starson17 - ahhh, it must be a very small library (less than 20 books perhaps?). I borrowed some code from my quality check plugin and it would suffer from the same issue. I will stick a new version up shortly.