Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 02-08-2011, 02:41 PM   #61
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiwidude View Post
Sorry for the long ramble, it was the closest I had come to trying to organise my thoughts.
I'm guilty of the same thing. I often find that posting helps organize (organise if you prefer ) my thoughts. I hadn't really come to grips with the "best" way to handle the duplicate groups issue or the dialog vs. Library UI issue, but it looks like you're zeroing in on a good way to handle it all, without too much new code and too many new interfaces for user to learn.
Starson17 is offline   Reply With Quote
Old 02-08-2011, 02:58 PM   #62
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by Starson17 View Post
Yes, and after he's run the duplicate finder, and merged or fixed dupes, he could assign "00" as the duplicate group number to flag any duplicates he doesn't want to find later. It would avoid repeatedly looking at false positives. A new run of the dupes finder would rewrite the dupe group column, but ignore any "00" flagged files.
Perhaps even better would be to mark all books that were checked by 00. After that only new books would be checked against all other books in library. 00 marked books wouldn't be compared *against each other*
kacir is offline   Reply With Quote
Advert
Old 02-09-2011, 04:58 AM   #63
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Kiwidude mentioned an email I sent him on this issue where I suggested using a custom column. Here is it, verbatim
Spoiler:
I saw some discussion of doing some sort of duplicate detection. I confess that I didn't read it carefully, so what I say below may be meaningless.

It seems to me that the goal is to partition the library into sets of books that might be duplicates of each other. The question that immediately arises is 'how does one determine that possibility?" I propose a set of functions that produce comparable values.

I see four functions:
- fuzzy title. The function returns the fuzzed title. Several people have ideas on what this computation might look like. Perhaps more than one fuzz algorithm might be offered.

- fuzzy author. Similar to title

- soundex title. Returns a soundex (sounds like) string. Not sure if it should operate on the straight title or the fuzzy title.

- soundex author. Similar to soundex title.

The user would pick which functions are to be used to determine if two books are duplicates. The more functions used, the more exact the match.


The algorithm:

Code:
for book in books:
  compute functions on book. r = concatenated results in a known order
  if r not in somedict:
    somedict[r] = set()
  somedict[r].add(book.id)
After running this algorithm, we have a dict of sets. If a set has more than one item in it, we have a potential duplicate. I propose that this partitioning is exposed to the user through populating an is_multiple custom column.

algorithm:

Code:
empty the custcol of all information.
group_number = 1
for k in somedict:
  if len(somedict[k]) > 1:
    for id in somedict[k]:
      add string 'group_'+str(group_number) to the custcol for book id
When done, each set of potential duplicates is represented by an item in the custom column, available in the tag browser and through other methods. The user would select a group, decide whether or not any of the members are duplicates, and take the appropriate action. When done, the tag is deleted (this might be something that a plugin could do conveniently).

By using the highlight checkbox in conjunction with the search, the user can also see books 'near' the various duplicates. Not sure if this is useful.

Performance should be acceptable. Computing the dict of sets has performance linearly dependent on the number of books in the library. Adding the tags has performance linearly dependent on the number of potential duplicates. My guess is that for a 10,000 book library, this computation shouldn't take more than 10 to 20 seconds, especially if commit is set to false when adding the tags. Once the last_changed date is available, the computations can be cached using the per-book storage mechanism, and recomputed if the metadata has changed.


In further conversations with kiwidude, the question of false positives came up. My suggestion would be to permit the user to say that two (or more?) given books are not duplicates. This information would be used by the duplicate detector to ensure that those books never appear together in a duplicate-book partition. The performance of this check would be very good if set arithmetic is used. Something like (in pseudo-code)

Code:
for each partition_set created by the dup finder
  for book_id in partition_set
    result_set = partition_set - books_are_not_dups[book_id]
    if result_set:
      create a real partition label
      for b_id in result_set:
        add partition label to the cust col for b_id
After running this test, we would have a custom tags-like column containing partition IDs where each ID marks a set of books that are potentially duplicates. Books that are declared not to be duplicates will never appear together in a partition.
chaley is offline   Reply With Quote
Old 02-09-2011, 09:11 AM   #64
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by chaley View Post
Kiwidude mentioned an email I sent him on this issue where I suggested using a custom column.
I wonder if I've understood it correctly. The custom column would be is_multiple and populated with book id's of potential duplicates? Presumably each book in the set of potential dupes would have the same content in that column, in the same order, and books with no potential dupes would have nothing in that column? Highlighting would/could be used to highlight all members of a single dupe set? Since the column would be populated in known order, sorting by that column would put all dupes together?

That's very similar to my idea, except, I had in mind assigning a number for each dupe set, then highlighting odd numbered sets so that when sorted by the dupe set number, all members would be together for all dupe sets, and the highlighting would identify all dupe sets (the first set would be highlighted, the second even numbered set would not, the third odd numbered set would be highlighted, etc.)

I wasn't sure of the best way to avoid repeat false positives. Simply removing a book from any future dupe groups has problems. You might want to prevent a book from appearing in one test because it was truly a false positive, but later want to run another type of dupe test that might truly find it as a dupe of some other book in this different test mode. You might add another book later that does match a book that was previously marked as a false positive.

As usual - those are just random thoughts.

I guess Kiwidude has his work cut out for him
Starson17 is offline   Reply With Quote
Old 02-09-2011, 09:17 AM   #65
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kacir View Post
Perhaps even better would be to mark all books that were checked by 00. After that only new books would be checked against all other books in library. 00 marked books wouldn't be compared *against each other*
I suspect that this would be a bit too aggressive. I've run many dupe tests on my library using various SQL queries. My experience is that you want to run many different tests, as each individual test may find some dupe books, but not others. I don't mind the time to run a second dupe test over my books, even if I'm pretty sure they aren't dupes. If they weren't found to be dupes the first time, they shouldn't show up as a false positive the second time. If they do, then I want to review and find out why it's been identified as a dupe this time, when it wasn't the first time.

Last edited by Starson17; 02-09-2011 at 09:51 AM.
Starson17 is offline   Reply With Quote
Advert
Old 02-09-2011, 11:42 AM   #66
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by Starson17 View Post
I wonder if I've understood it correctly. The custom column would be is_multiple and populated with book id's of potential duplicates?
No. It is populated with duplicate set names. I think these are the same as what you are calling group numbers.
Quote:
Presumably each book in the set of potential dupes would have the same content in that column, in the same order,
A book will have the set_ids it belongs to. Order is irrelevant. A book can be in multiple duplicate sets, especially when using the not-duplicate processing.
Quote:
and books with no potential dupes would have nothing in that column?
Yes, no entries
Quote:
Highlighting would/could be used to highlight all members of a single dupe set?
Simply searching for a set id (group number) would find the potential duplicates. Use the highlight option if you want to see them in context.
Quote:
Since the column would be populated in known order, sorting by that column would put all dupes together?
No. The column is an is-multiple column (tags-like column). Sorting on the column is almost certainly not useful. Also, as a book can be in multiple duplicate sets, it isn't clear that sorting would say very much.
Quote:
That's very similar to my idea, except, I had in mind assigning a number for each dupe set, then highlighting odd numbered sets so that when sorted by the dupe set number, all members would be together for all dupe sets, and the highlighting would identify all dupe sets (the first set would be highlighted, the second even numbered set would not, the third odd numbered set would be highlighted, etc.)
Yes, it is similar to your numbering. However, given that a book can be in multiple sets, I am not sure about the odd/even highlighting. I think that searching for a dup set and then sorting how you want is the way to go.
Quote:
I wasn't sure of the best way to avoid repeat false positives. Simply removing a book from any future dupe groups has problems. You might want to prevent a book from appearing in one test because it was truly a false positive, but later want to run another type of dupe test that might truly find it as a dupe of some other book in this different test mode. You might add another book later that does match a book that was previously marked as a false positive.
I am proposing that the user tell us that two (or more) books are not duplicates. This means that regardless of what any test says, these two books should never be in the same duplicate group. I don't see a case where a user would say that the books are not duplicates, but later decide on the results of some other test that they are.
Quote:
As usual - those are just random thoughts.

I guess Kiwidude has his work cut out for him
chaley is offline   Reply With Quote
Old 02-09-2011, 12:17 PM   #67
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by chaley View Post
No. It is populated with duplicate set names. I think these are the same as what you are calling group numbers.
Got it. So the "duplicate set names" would be be auto generated (they are numerical names?) and the is_multiple is because the book can be in multiple "duplicate set names."

Quote:
given that a book can be in multiple sets, I am not sure about the odd/even highlighting. I think that searching for a dup set and then sorting how you want is the way to go.
I hadn't thought much about single books in multiple duplicate sets. I had been thinking about how can I show only books that are a member of a duplicate set (that's easy by excluding books that don't have an entry) and show all books at once that are a member of any dupe set, with each book sitting side by side with all the other members of its duplicate set. That's harder - particularly where multiple dupe sets are considered.

If you show all books at once, some kind of divider is needed showing where each dupe set ends and the next starts. In working with my personal duplicate testing I found myself with lots of dupe sets, and having to do lots of searches - one for each group. It was a pain. Perhaps a quick key could be assigned to "show members of next duplicate set" but trying to actually type a new search for each new duplicate set would get repetitive. I suppose you're thinking of doing this via the tag browser, which I seldom use.

Quote:
I am proposing that the user tell us that two (or more) books are not duplicates. This means that regardless of what any test says, these two books should never be in the same duplicate group. I don't see a case where a user would say that the books are not duplicates, but later decide on the results of some other test that they are.
Right. They can later be in duplicate sets with other books, but never in a set with the same book previously considered.
Starson17 is offline   Reply With Quote
Old 02-09-2011, 12:26 PM   #68
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by Starson17 View Post
I hadn't thought much about single books in multiple duplicate sets. I had been thinking about how can I show only books that are a member of a duplicate set (that's easy by excluding books that don't have an entry) and show all books at once that are a member of any dupe set, with each book sitting side by side with all the other members of its duplicate set. That's harder - particularly where multiple dupe sets are considered.
It is impossible for books in multiple sets
Quote:
If you show all books at once, some kind of divider is needed showing where each dupe set ends and the next starts. In working with my personal duplicate testing I found myself with lots of dupe sets, and having to do lots of searches - one for each group. It was a pain. Perhaps a quick key could be assigned to "show members of next duplicate set" but trying to actually type a new search for each new duplicate set would get repetitive. I suppose you're thinking of doing this via the tag browser, which I seldom use.
There clearly has to be a plugin involved here. I am assuming that books are viewed on the library view, rather than reinventing another view in the plugin. I see no reason why the plugin cannot handle the 'show members of next duplicate set' through a context menu entry or through a keyboard shortcut. It would remember the last one it looked at, get the next one (easy if they are numbers), and do the search for you. You can use the highlight option if you want to see them in a larger context, or turn highlight off to see only that set.

Last edited by chaley; 02-09-2011 at 12:30 PM. Reason: typos
chaley is offline   Reply With Quote
Old 02-09-2011, 03:06 PM   #69
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by Starson17 View Post
I suspect that this would be a bit too aggressive. I've run many dupe tests on my library using various SQL queries. My experience is that you want to run many different tests, as each individual test may find some dupe books, but not others. I don't mind the time to run a second dupe test over my books, even if I'm pretty sure they aren't dupes. If they weren't found to be dupes the first time, they shouldn't show up as a false positive the second time. If they do, then I want to review and find out why it's been identified as a dupe this time, when it wasn't the first time.
Right. I haven't realised that.
Especially when running a query doesn't require excessive time.
kacir is offline   Reply With Quote
Old 02-09-2011, 09:38 PM   #70
vitalichka
Enthusiast
vitalichka began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Jan 2011
Device: Nook Color
I've seen this thread several times while looking to workout the duplicate issue.
Then was again led here by kiwidude and dwanthny a little while ago.
I've read much of what everyone has posted here and from what I gather there seems to be no real way of fighting duplicates once all of the books have been added.
Am I understanding this right? Outside of manually going through everything yourself which in my case would be impossible.
The other issue in my case is that I had the auto merge setting off for 80% or more of the time and towards the end of the adding process turned it on as per the suggestions of the two guys above.
So this sort of complicates things eve more.

Also does anyone know if there is a way to have Calibre work with titles in a better way? As in, currently there is the option of gathering based on meta data and file name.

Some of my files have meta and other don't and some would be better with the file name, since they don't contain any meta data but the file name is clear so right now it seems like in some cases simple windows search would do the trick better but since I don't want two copies of the library (large) I am stuck. If this makes sense.

Thanks everyone.

Last edited by vitalichka; 02-09-2011 at 09:40 PM. Reason: some minor changes to clear things up
vitalichka is offline   Reply With Quote
Old 02-09-2011, 10:57 PM   #71
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,601
Karma: 2092290
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by vitalichka View Post
I've seen this thread several times while looking to workout the duplicate issue.
Then was again led here by kiwidude and dwanthny a little while ago.
I've read much of what everyone has posted here and from what I gather there seems to be no real way of fighting duplicates once all of the books have been added.
Am I understanding this right? Outside of manually going through everything yourself which in my case would be impossible.
The other issue in my case is that I had the auto merge setting off for 80% or more of the time and towards the end of the adding process turned it on as per the suggestions of the two guys above.
So this sort of complicates things eve more.
As of today in Calibre you are correct, there is no built-in functionality to help you identify the duplicates that you have already imported into your library. All you can do is manual visual inspection.

The recent posts on this thread have been about possible ideas for building a plugin tool for Calibre that *will* attempt to identify duplicates already in the library. What we haven't quite nailed down as yet before I start writing it is exactly how it might work, though I think we are iterating closer to that with recent posts. I've got some other plugins I want to finish developing first over the next week or two and then it will have more of my focus to get stuck into it.

In the meantime, your options are to (a) be patient until we get the development done, (b) manual inspection which I agree isn't practical for a large library, or (c) use tools/scripts outside of Calibre to query against the database, some of which you will find in old threads on these forums if you search.
Quote:
Also does anyone know if there is a way to have Calibre work with titles in a better way? As in, currently there is the option of gathering based on meta data and file name.

Some of my files have meta and other don't and some would be better with the file name, since they don't contain any meta data but the file name is clear so right now it seems like in some cases simple windows search would do the trick better but since I don't want two copies of the library (large) I am stuck. If this makes sense.
I'd love to hear a better answer too but afaik the answer is no, there is no magic fairy dust option. Garbage in, garbage out as far as Calibre is concerned. If you can't rely on metadata in the file (which you can't if you import formats like TXT which have none, or use LN, FN author format and the metadata isn't matching that) then you have to switch that option off as I do and then it is all down to filename.

My personal workflow is:
(1) Use Duplicate File Finder (a free tool) to scan my input directories and Calibre before I do anything with the files. That lets me get rid of exact CRC duplicates without caring about any filename cleanups.
(2) Use a tool I hacked together in C# which lets me quickly slice and dice filenames in bulk to match exactly my Calibre add regular expression with various hotkeys.
(3) Using that same tool, do a "pre-add" to Calibre by querying the Calibre database and looking for matches on author and title (similar to Automerge algorithm). This moves the files into different import subfolders ready to add depending on whether they are a new book, a new format for an existing book or a duplicate format. If a duplicate format it will require a visual comparison to identify which version I want to keep. Although if the filesize differs within a low % of the existing Calibre book filesize then I will push it into a fourth folder for deletion (to allow for EPUBs with touched bookmarks etc).
(4) I also have special processing to take care of html folders of books, since they need to be added "one per folder" whereas the rest are "many files per folder"
(5) I go ahead and import the folders of "new books" and "new formats of an existing book", with automerge turned on. The "duplicates for deletion" folder gets tossed, and the "duplicate formats" folder has to be manually compared before I add one by one.
(6) I have additional screens in my tool which run various sql queries against the database that i do periodically to pickup duplicate authors or titles with various fuzzy logic algorithms.

Long winded - sure. But you hope to only do it once . And in case you ask it no the tool I wrote isn't available, it is too specific to how I work and the code is true hack filth. In theory it could be rewritten to be a Calibre plugin if enough people thought it would be useful but that would be a load of work. I certainly hope parts of it will become deprecated like some of the duplicate comparison stuff thanks to the recent additions by Starson/Kovid for the next release and of course the duplicate finder plugin when it appears. However they still won't solve some of the fundamental issues of getting your filenames 100% correct before you add to Calibre.
kiwidude is offline   Reply With Quote
Old 02-10-2011, 07:47 AM   #72
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Eyes glazing over ...

Quote:
Originally Posted by chaley View Post
In further conversations with kiwidude, the question of false positives came up. My suggestion would be to permit the user to say that two (or more?) given books are not duplicates. This information would be used by the duplicate detector to ensure that those books never appear together in a duplicate-book partition. The performance of this check would be very good if set arithmetic is used. Something like (in pseudo-code)

snip
After a bit of research and thought, I realized that a) the above algorithm doesn't work, and b) this is really a graph theory problem. The duplicate sets are equal to the number of different paths through a graph of nodes (books) in the test result, removing edges between nodes that are known not to be duplicates.

An example implementation is under the spoiler.
Spoiler:
Code:
from collections import defaultdict

# Construct map of books that are not duplicates
dups = [(3, 7), (10, 66, 11)]
print 'books known not to be duplicates', dups
not_duplicate_of_map = defaultdict(set)
for t in dups:
  s = set(t)
  for b in t:
    not_duplicate_of_map[b] = s

# Simulate a test
initial_dups =  [2, 3, 66, 7, 10, 11, 12]
initial_dups.sort()
print 'candidate duplicates', initial_dups

# Walk the nodes as a directed graph, refusing to visit nodes that 
# are declared to be not duplicates of a node already visited. This
# algorithm depends on the lists being sorted.
found_paths = []

def walk(node, path, remaining_nodes):
  path.append(node)
  rn = [n for n in remaining_nodes if n != node and n not in not_duplicate_of_map[node]]
  if len(rn) == 0:
    found_paths.append(sorted(path))
  for n in rn:
    if n > node:
      walk(n, path, rn)
  path.pop()


for k in initial_dups:
  walk(k, [], initial_dups)
print 'After partioning', found_paths

This implementation depends on the fact that the items in the graph are numbers and can be fully ordered, so it is easy to prune duplicate paths simply by never traversing an edge to a node less than the one in hand. As the list is sorted, such an edge would already have been traversed in the other direction, so the algorithm does not need to deal with discovering both (1,2,3) and (2,3,1).
chaley is offline   Reply With Quote
Old 02-10-2011, 08:08 AM   #73
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,601
Karma: 2092290
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by chaley View Post
...An example implementation is under the spoiler...
Dude, go ahead and write the plugin, you know you want to... we've clearly peaked your interest in it now...

Loving the posts, it's an interesting little challenge...
kiwidude is offline   Reply With Quote
Old 02-10-2011, 08:26 AM   #74
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,703
Karma: 6658935
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
Dude, go ahead and write the plugin, you know you want to... we've clearly peaked your interest in it now...
Nah...

The interesting part for me is manipulating similarity data efficiently and accurately. That will be a small part of the plugin. The larger part, GUI and other 'look and feel' stuff, is far better left to someone else.
chaley is offline   Reply With Quote
Old 02-10-2011, 09:08 AM   #75
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiwidude View Post
As of today in Calibre you are correct, there is no built-in functionality to help you identify the duplicates that you have already imported into your library.

<snip>

In the meantime, your options are to (a) be patient until we get the development done, (b) manual inspection which I agree isn't practical for a large library, or (c) use tools/scripts outside of Calibre to query against the database, some of which you will find in old threads on these forums if you search.
The one option that currently does exist for global duplicate finding is to use Copy to Library with Automerge turned on to export your entire library to another library. Durning that proces, automerge will try to put identical formats into a single book record. It's going to give results similar to what would have happened if that option had been turned on from the beginning of importing. It's mainly useful where you had good author/title data, but had automerge off and accepted lots of duplicate records, one for each format.

It's far from perfect, and you might even lose some metadata where some records have good data and other matching records have bad data. It depends on the order of processing. You keep metadata only from the first record processed. The formats for matching records are added to that first record, but no metadata is added.

If you've got garbage for author/title, it won't do much good.
Starson17 is offline   Reply With Quote
Reply

Tags
duplicate

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplicate Detection albill Calibre 2 10-26-2010 02:21 PM
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 04:56 AM
Device Detection doom Alberto Franches Calibre 6 06-24-2010 05:38 PM
Device detection? totanus ePub 1 12-17-2009 07:05 AM
Structure detection v5.5 and v6.2 AlexBell Calibre 2 07-29-2009 10:11 PM


All times are GMT -4. The time now is 09:30 PM.


MobileRead.com is a privately owned, operated and funded community.