MobileRead Forums - View Single Post

drMerry · 05-31-2011, 08:18 AM

Quote:

Originally Posted by kiwidude

You either have some scenario in mind where you think pages/file size would be useful, or you are just proposing some random thoughts. I don't mind random thoughts as sometimes they spark better ones, but in this case I don't see where you are going with this one?

Thank you for the information.
Well, I do have a scenario in mind.
As I said, this function is a second pass, filtering duplicate files.
So it does not say that files are unmarked as possible duplicates, it just filters the results one way.

For example, when I run some duplicate checks, I get in return a list of 1200 books. All possible duplicates.
Let's say I have these duplicates inside the list:

marked:duplicate_group_0001:
Book A Epub
Book C pdf

marked:duplicate_group_0002:
Book A Epub
Book B Epub (is a binary duplicate of A)

If I added the different formats, I would only see group 1 giving me the option to easily merge this group. So I can eliminate some of the dups a lot faster.

For book size (of course you can't tell dups by size, but as this is a filter after the dup test...) it is a little different.

When I have 1200 possible duplicate books, I would be happy to see all books with a small size-difference. When I see a book of 0.7MB and one of 12.3 MB, I can imagine the content of the book is not the same (technical information, presentation for user can be (bmp <-> jpg)).
But if I could only see the books having say, less than 1k difference, I would have a list of books that are far more likely to be duplicates. For example, if one book has downloaded comments and the other has not. I could just open the books, take a quick look and see if they are the same before I remove them.

The page-function could be used with your page-count plugin. If I see a possible duplicate book with same number of pages (or +/- 1) the change it is a duplicate increases. A book of 100 and 326 pages are more likely to be different.
So in stead of pages, you could make it a custom-field compare to compare 2 int ore floating fields
This than would directly add the option to hide books with the same name but a different series-index (a filed that could be custom set by the user)

***EDIT***
One (manually filtered) example is in the screenshot below. As you can see I added a [other version] for some books to remove them from title check.
You can also see the difference in booksize / page-numbers. They are all non-duplicates, filtering on pages would exclude these books from view.