MobileRead Forums - View Single Post

kiwidude · 04-27-2011, 03:26 AM

Quote:

Originally Posted by collin8579

So out of curiosity, why couldn't this be a content based search instead of title/author
calibre can read the contents and display them
I know it would take longer
but if you have a book with 95% of the same words, its probably a dupe regardless

It wouldn't be slow. Slow is far too generous. Glacial would be a better choice of words.

For a start, every format of every book has to be converted to a single format. If you have ever seen the posts on this forum about how it took one particular conversion x hours to run - well multiply that out for users with large libraries and you can see it would have a running time of days if not weeks.

What about all those books that calibre can't convert, like image based PDF files, CBZ files etc? Or people who have empty book entries for wish list items or representing their paperback editions which have no electronic versions to compare? Don't those deserve duplicate consideration too?

Then to round it all off, every time you add even just a single book format to your library, you would have to incur the whole penalty all over again, as it must compare that books content with every other book. Well unless you kept that whole temp directory structure of hundreds of thousands if not millions of files around, but even then you must still incur a very expensive cost of reading all the file contents and applying a fuzzy heuristic to compare the text.

By comparison, with this plugin I can test 40000 books in under a second and once my exemptions are in place any future comparisons will take negligible time to perform and maintain.

That is not to say a content based search would not have some advantages of course. One problem this plugin cannot help you with is books that had the wrong filename or metadata when imported. So you think you have book 5 in a series but in actual fact it Is just a copy of book 3 or whatever. However a visual inspection will reveal that, which you should do before you merge identical formats anyways. That was one of the reasons I requested starson to enhance automerge so that identical formats do not have to be discarded, giving you a chance to compare them first.

So, there are some of the reasons why I didn't take that approach. It just isn't workable in my opinion, or certainly not for many users.