View Single Post
Old 10-20-2012, 05:55 AM   #331
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,224
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@rotanal - changing a binary comparison to not actually be a binary comparison will not *ever* happen with this plugin.

I really don't see a use case for some kind of fuzzy comparison. You already have all the fuzzy comparisons required based on *metadata* to identify whether two books are duplicates. The only reason for doing a binary comparison is to most quickly be able to delete the corresponding duplicate with 100% assurance that you are not accidentally losing something.

Scanning say two epubs and then deciding that based on their content they are "mostly similar" tells you nothing useful (and bear in mind that is just an easier to compare format, let alone all the others). The one exception being if you have screwed up your library metadata and given the book a title/author which it isn't. But that is such a niche case it isn't remotely worth the enormous effort to cater for.

You can find duplicates using the existing metadata based functions. Having found those duplicates, deciding which ones to merge is another matter. Again if you think that "a few bytes" difference means one can safely and automatically deleted as being "almost binary" you are mistaken. As ilovejedd points out those "minor" differences could be the difference between a corrupt and a non corrupt book. Or one that is formatted to your liking versus one that is not. Or one that has been proofed for errors versus one that has not. Or a later edition. Or a different cover. Or one which has encoding a set correctly to make readable, etc, etc. You can't reliably automate those evaluations to determine what is the best version to keep. You have to open them up side by side and decide based on your own personal criteria.

As I have had said repeatedly from the time I created this plugin ages ago, there is a space for someone to write a separate smart merge plugin. To allow a user who having been given the duplicate results from this plugin to make better informed merge decisions to help with deciding which to keep. For instance if two epubs are being merged, it could examine the zip files and compare contents files to tell you which differ etc. But such a plugin does not exist and I have no personal interest in writing it as I have no need for it.
kiwidude is offline   Reply With Quote