View Single Post
Old 05-30-2011, 06:56 PM   #66
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
There are a number of issues with this.

Firstly - for duplicate books that come back with the same title and author, suddenly saying they are not duplicates of each other because of something about their book formats doesn't make sense. There is no consistency with what formats you may have associated to each book record, and whether they do or do not overlap formats.

Secondly, "pages" is not an available property of a book. It is something that can only be approximated with a computation. And for formats other than ePub or Mobi, that computation requires a conversion. So your Find Duplicates check will now take forever to run.

Thirdly I hold scepticism about using things like number of pages or file size to dictate whether two books are the "same". Certainly you can tell they are "different", but you can never say with any certainty they are the "same". Particularly given the widly differing approximations you get from page calculations. And having just a higher res image significantly skews file sizes. So I really don't see how it can do anything other than tell you they differ. The only thing that tells you books are definitely the same is a binary comparison.

Finally, doing anything at "format" level is problematic with the Calibre UI. There is no way for the UI to show rows of book formats, you can only see books. This has already been discussed/highlighted by the binary duplicate check which is at format level.

This plugin is primarily about finding duplicate book records. If it is bringing back books which you don't think are duplicates of each other (as happens increasingly the fuzzier the algorithm), then the appropriate solution to that is to create the exclusions for those authors or titles.

So then the only other issue is having got book records which you know *are* duplicate books, how do you resolve the formats they contain. And again I keep coming back to that being a merge issue, not a find duplicate book issue. Though I would never use #pages or file size to tell me which format to keep when merging. You always have to open the formats side by side to decide that. You could have a crappy PDF conversion with completely screwed paragraphs with blank lines in between totally affect the page count. And as I said above images alone can dramatically affect file size.

You either have some scenario in mind where you think pages/file size would be useful, or you are just proposing some random thoughts. I don't mind random thoughts as sometimes they spark better ones, but in this case I don't see where you are going with this one?
kiwidude is offline   Reply With Quote