MobileRead Forums - View Single Post

kiwidude · 12-10-2010, 10:28 AM

I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results. It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have. Similar starts with type checks done on Authors (taking into account the first initial). And as all my Authors are supposed to be stored LN, FN I also look for names stored as FN LN (no comma) or authors where the names were imported the wrong way around so stored as FN, LN.

It "does a job" and helps me eliminate many duplicates I would otherwise have. However one of the issues as Starson17 says is that for certain types of checks there are genuine exceptions to a rule, and you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books. The "fuzzier" the search, the better chance of finding duplicates but the more false positives you have to keep looking through.

My current solution to this is to:
- run the duplicate check and process the results until happy with them.
- run the check again. The output this time should just be the stuff I am happy to be treated as exceptions. I save that output as a text file.
- Then the next time I run the check, I open up the previous output in Notepad++, paste the new output into another tab then use it's built-in diff functionality to just highlight the "new" duplicates it has detected.
- once that is done, go back to the second step, thereby overwriting my new baseline.

All that baseline/persistence/identify only new stuff could be built into a tool but the above was just a quick and dirty "get it done" approach I use. Will be interested to see what evolves from other ideas people have. Just food for thought.

12-10-2010, 10:28 AM	#8
kiwidude Calibre Plugins Developer Posts: 4,733 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results. It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have. Similar starts with type checks done on Authors (taking into account the first initial). And as all my Authors are supposed to be stored LN, FN I also look for names stored as FN LN (no comma) or authors where the names were imported the wrong way around so stored as FN, LN. It "does a job" and helps me eliminate many duplicates I would otherwise have. However one of the issues as Starson17 says is that for certain types of checks there are genuine exceptions to a rule, and you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books. The "fuzzier" the search, the better chance of finding duplicates but the more false positives you have to keep looking through. My current solution to this is to: - run the duplicate check and process the results until happy with them. - run the check again. The output this time should just be the stuff I am happy to be treated as exceptions. I save that output as a text file. - Then the next time I run the check, I open up the previous output in Notepad++, paste the new output into another tab then use it's built-in diff functionality to just highlight the "new" duplicates it has detected. - once that is done, go back to the second step, thereby overwriting my new baseline. All that baseline/persistence/identify only new stuff could be built into a tool but the above was just a quick and dirty "get it done" approach I use. Will be interested to see what evolves from other ideas people have. Just food for thought.