View Single Post
Old 04-26-2011, 05:08 PM   #1
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,228
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
[GUI Plugin] Find Duplicates

This plugin will help you to identify duplicate authors, titles, formats, series, publishers, tags and identifiers in your Calibre libraries.
  • Duplicate authors are where you have multiple variants of an author due to spacing, punctuation, spelling differences or word order. e.g. Kevin Anderson / Kevin J. Anderson / Keven Anderson / Anderson, Kevin / Anderson Kevin / Bloggs, Joe & Anderson, Kevin
  • Duplicate titles are where you have multiple book entries with either the same or varying titles. e.g. Martian Way / The Martian Way / The Martian Way (2010) / The Martian Way and Other Stories
  • Duplicate formats are where the contents of a particular format like ePub are binary identical to another in your library
The plugin offers a variety of matching algorithms for finding possible groups of duplicate candidates. Each algorithm combination provides a differing tradeoff of the number of genuine duplicates found versus the number of false positives (near duplicates).

When the search is complete the results of each group are presented to you to navigate through. You can then do one of three things:
  • If the group contains genuine duplicates, use the existing Merge feature in the Edit metadata menu to resolve the duplicate book entries.
  • If the group contains non duplicates, you can mark the group as exempt to prevent those books or authors from appearing together in future searches.
  • Skip the group for now and just move to the next one, either deferring your decision or to mark all remaining groups as exemptions when finished.

New to version 1.4 is a "Find metadata variations" menu which allows you to find variations of author, publisher, series and tag names and rename directly on this dialog. Again a number of different matching algorithms are available for use.

Version 1.5 has added the ability to perform duplicate comparisons across multiple libraries. So for instance if you have a "working" library and a "main" library, you can search for duplicates between those libraries with the same range of algorithms and produce a report for later resolution.

Main Features of v1.6.2:
  • Searches either your entire library or respecting any search restriction set at the time you Find Duplicates.
  • Choose your desired combination of title and author matching from any of "identical", "similar", "soundex", "fuzzy" or "ignore" algorithms.
  • Choose alternative algorithms such as matching identifiers or binary comparison.
  • View the results either one group at a time, or showing all duplicate candidates at once using highlighting to show the groups.
  • When doing author duplicate searches (ignore title), optionally highlight the authors under consideration in the tag browser for ease of renaming
  • Sort the result groups either by title/author (default) or by the size of the group
  • Fine tune the soundex algorithm options to make them "fuzzier" or more explicit matching.
  • Optionally include the languages field when comparing titles, so intentionally using the same book title in different languages does not show as duplicates.
  • Optionally have binary duplicate formats automatically removed from your library when doing a binary comparison.
  • Mark the current group as exempt or all groups as exempt from appearing as duplicates again
  • Review your duplicate exemptions with the opportunity to reverse the exemption allowing duplicate consideration again
  • Exempt either individual books (title searches) or authors (author searches)
  • Clicking the clear search button, setting a different restriction or choosing an explicit Clear duplicate results menu option will exit duplicate search mode.
  • Switching libraries or restarting Calibre will also clear any duplicate search results. Your exemptions will be remember and are stored per library.
  • Customize the keyboard shortcuts for a number of the menu options.
  • Find metadata variations for authors, publishers, series and tags to eradicate unwanted duplicates with an alternative simplified UI to rename them.
  • Find duplicates across multiple libraries, producing a report.
  • When placed on the toolbar, clicking the toolbar button without duplicate groups displayed will display the Find Duplicates options dialog. When results are displayed, clicking on the button will move to the next result. Ctrl+click or shift+click to navigate to the previous result.

Special Notes:
  • Requires Calibre 0.8.59 or later.

Installation Notes:
Suggested Workflow:
Spoiler:
Here are some tips to help get you started:
  • Finding duplicates is an iterative, multiple step process. The order is entirely up to you, however a little planning can help reduce the number of possible book combinations you have to consider.
  • If your library is small you can do less "passes" if you choose as the number of results returned by the fuzzy/soundex matches may be less initimidating.
  • I like to start with resolving duplicate authors first (set title match to "ignore"). Managing your authors first means that you will later be able to use an "identical" author match and have a higher likelihood of genuine duplicate titles. You may also find the new Find metadata variations dialog ideal for this purpose.
  • Start with the most likely duplicates first - such as an ignore title/similar author search. Then progress the author matching to fuzzy/soundex matches to uncover your other author variations and misspellings.
  • I like to use the Search the Internet plugin to view the authors on FantasticFiction.co.uk etc to verify that variations of a name are not genuinely different authors.
  • Now you can repeat the process for your titles. Set the author search to "identical", and progress your title searches as you prefer.
  • Throughout the process make use of the exemptions feature. This will prevent the need to reconsider those particular combinations of authors or titles again in future.

Paypal Donations:
  • If you find this or any of my other plugins useful please feel free to show your appreciation. I have spent many hundreds of unpaid hours in their development and support so any encouragement for me to continue is appreciated!

Version History:
Spoiler:

Version 1.6.2 - 28 Jul 2014
Support for upcoming calibre 2.0

Version 1.6.1 - 03 Jan 2013
Fix for when comparing library duplicates to ensure saved searches are not corrupted.

Version 1.6.0 - 29 Oct 2012
Change "ISBN Compare" to "Identifier" with a dropdown allowing comparison of any identifier field.
Add a context menu to the metadata variations list to allow choosing the selected name on the right side.

Version 1.5.3 - 14 Aug 2012
When using "Find library duplicates" display all duplicate matches for the current library as marked:duplicate (except for author duplicates)

Version 1.5.2 - 21 Jul 2012
When using "Find library duplicates" clear the current search in order to compare the entire restricted library
When using "Find metadata variations" and showing books, fire the search again to ensure results reflect the search

Version 1.5.1 - 21 Jul 2012
Add a "Save log" button for the "Find library duplicates" result screen.

Version 1.5.0 - 20 Jul 2012
Add a "Find library duplicates" option for cross-library duplicate comparisons into a log report
If currently running a duplicate book search and execute a metadata variation search, clear search first

Version 1.4.0 - 17 Jul 2012
Now requires calibre 0.8.59
Add a Find metadata variations option to search for author, series, publisher and tag variations, and allow renaming them from the dialog.
Fix bug of fuzzy author comparisons which will no longer compute a reverse hash to reduce the false positives it generated

Version 1.3.0 - 22 Jun 2012
Now requires calibre 0.8.57
Store configuration in the calibre database rather than a json file, to allow reuse from different computers (not simultaneously!)
Add a support option to the configuration dialog allowing viewing the plugin data stored in the database
Add an option to allow automatic removal of binary duplicates (does not delete books records, only the newest copies of that format).

Version 1.2.3 - 02 Dec 2011
Make the languages comparison optional (default false) via a checkbox on the Find Duplicates dialog

Version 1.2.2 - 25 Nov 2011
Take the languages field into account when doing title based duplicate comparisons

Version 1.2.1 - 12 Nov 2011
When selecting ISBN or Binary compare, hide the Title/Author groupbox options
Some cosmetic additions to the text for ISBN/Binary options

Version 1.2.0 - 11 Sep 2011
Fix bug for when switching to an ignore title search where author search was previously set to ignore.
Remove customisation of shortcuts on tab, to use Calibre's centrally managed shortcuts instead.

Version 1.1.4 - 04 Jul 2011
Additional fix for stuff broken by Calibre 0.8.8 in the tag view
Fix for removing an author exemption

Version 1.1.3 - 03 Jul 2011
Preparation for deprecation of db.format_abspath() for networked backend

Version 1.1.2 - 03 Jul 2011
Fix for issue with Calibre 0.8.8 tag browser search_restriction refactoring

Version 1.1.1 - 12 Jun 2011
Add van to list of ignored author words
Fix bug of error dialog not referenced correctly

Version 1.1 - 3 May 2011
Add support for binary comparison searches to find book formats with exactly the same content
Replace how exemptions are stored in the config file to make more scalable
No longer calculate exemption preview detailed messages for the confirmation dialog for performance
Compare multiple authors for most author algorithms to increase duplicate coverage.
Change Manage exemptions dialog to have tab for each author with exemptions and show section only if have exemptions
Include swapping author name order in all but identical author checks. So A B / B A or A,B / B,A will match.
Disable the Ignore title, identical author combination as will not a valid one (never duplicates)
Allow the remove, mark current and mark all group exemption dialogs able to be hidden from showing again.
Allow various count of result and no result information dialogs able to be hidden from showing again.
Allow user to reset confirmation dialogs related to find duplicates from the configuration dialog

Version 1.0 - 26 Apr 2011
Initial release of Find Duplicates plugin

Attached Thumbnails
Click image for larger version

Name:	Screenshot_1_Toolbar.png
Views:	1605
Size:	13.9 KB
ID:	70542   Click image for larger version

Name:	Screenshot_2_Configuration.png
Views:	1444
Size:	12.0 KB
ID:	70543   Click image for larger version

Name:	Screenshot_2_Options.png
Views:	1168
Size:	33.1 KB
ID:	70544   Click image for larger version

Name:	Screenshot_3_ManageExemptions.png
Views:	1177
Size:	22.0 KB
ID:	70545   Click image for larger version

Name:	Screenshot_4_Metadata_Variations.png
Views:	1247
Size:	31.6 KB
ID:	89279   Click image for larger version

Name:	Screenshot_5_Library_Duplicates.png
Views:	961
Size:	30.8 KB
ID:	89480  
Attached Files
File Type: zip Find Duplicates-qt5.zip (57.5 KB, 4868 views)

Last edited by kovidgoyal; 07-28-2014 at 03:27 AM. Reason: v1.6.2 Released
kiwidude is offline   Reply With Quote