07-03-2011, 07:44 PM | #106 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
perfectly working
|
07-04-2011, 02:58 AM | #107 |
Member
Posts: 18
Karma: 10
Join Date: Dec 2010
Device: none
|
Great plugin!
|
07-15-2011, 07:58 PM | #108 |
Junior Member
Posts: 1
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
@kiwidude: I love your plugin! Before another user tipped me off to it, I was using a lot of Perl scripts to manage my collection. It looks like you were way ahead of my efforts, and your plugin found hundreds of duplicates I missed.
If I may offer a suggestion; I previously had lots of books with the series in the title. ("Doctor Who: Something or Other" / "Star Trek: Something"). In order to detect duplicates, I used this technique: - fuzzy author match (same as yours: lastname + 1st initial) - Split up the title on these characters: "-:;,&" and the word "and". - Alert for a possible match if any of those pieces matched any other books - Allow for a piece to be 'whitelisted', so that it won't trip on 'Doctor Who' all the time That allows me to detect "Doctor Who: Something or Other" and the book "Something or Other" by the same author. Additionally, it can detect combos like: "Nightfall's Sequel" "Nightfall; Nightfall's Sequel; The third Nightfall Book" (an e-book that includes the text of 3 other books, a somewhat rare occurrence) |
07-16-2011, 07:27 AM | #109 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Thx @domee/@saintly, and welcome to MobileRead.
It is an interesting suggestion, and I can see your use case for it. Whether there is enough reason to justify the effort is always the question, as this would require a non-trivial amount of effort to slot in another algorithm. I don't have the time to seriously investigate it myself at the moment, but we have your sugggestion documented here so that it may be revisited in the future which is great. |
07-22-2011, 11:24 AM | #110 |
Member
Posts: 20
Karma: 10
Join Date: Jan 2009
Device: pocketbook PB741
|
catching dutch duplicates
Hello,
I see that your plugin doesn't catch duplicates with titles like "De Verlossing" and "Verlossing, De" Is there anything I can modify to the plugin to get also these variants ? E.g. ignore words like "De", "Het" and "Een" or is this something you have to program ? Kind regards, whitespirit |
07-22-2011, 01:00 PM | #111 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
07-22-2011, 02:05 PM | #112 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@starson17 - the Find Duplicates plugin has this line that would have been blatantly stolen from Automerge in the title find/replace patterns...
Code:
(tweaks.get('title_sort_articles', r'^(a|the|an)\s+'), ''), |
07-22-2011, 02:52 PM | #113 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
When I first put together Automerge, the "articles" were hard coded in English. I copied the hard coded stuff. Later, I think it was Charles who pushed it into the Tweaks. He didn't find my little theft of the original hard coding, so he didn't replace my code. When someone complained that Automerge didn't respect the tweak, I tracked down his work and stole that too.(or maybe it was Kovid's ?) Edit:@Whitespirit you want to look in preferences under tweaks for this option. It refers to "articles" in quotes - I forget the exact name for it. Last edited by Starson17; 07-22-2011 at 02:56 PM. |
|
07-31-2011, 08:22 AM | #114 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: prs-600
|
compare contents of epub files
I have been using your pluging to clean up the library that was pieced together here by our kids. It contained a lot of duplicates that I could find using the binary comparison option.
How ever I am also finding duplicate epub files that are not binary equal. Looking at the files shows that they are the same size, but within the " epub zip" there are some differences in the opf file. here an example: Het loterijbriefje - Jules Verne.epub this is a epub from the gutenberg project (ebook #30929) in the metadata section of the opf there is a small change: <dc:creator role="aut" file-as="Verne, Jules">Jules Verne</dc:creator> <dc:identifier scheme="ISBN"></dc:identifier> These 2 lines have been switched .. making it (from a binary standpoint) a different epub, but contents wise it is 100% identical. Is there a way to also find these find of duplicates? just looking a the metadata alone will not garentee that the actual contents is the same. I now used a trial version of altova diffdog to compare the contents of the two epub files. But it must be possible to do this automatically from within the plugin. when doing the metadata compare, do you use the opf from the calibre library? or the opf as contained inside the epub? |
07-31-2011, 08:40 AM | #115 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@rigolo - welcome to MobileRead.
In answer to your question - none of the Find Duplicate comparisons ever look inside the format (e.g. inside the EPUB). Nor do they directly look at the opf files sitting in the directory. For all but the binary comparison they use the data stored inside the metadata.db database that Calibre uses to manage your library - in theory this should match what those metadata.opf files contain within each book's folder but as I said above they are not directly compared. The binary comparison is exactly that - comparing effectively byte for byte that two files match. Trying to compare the internal contents of a book format using this plugin is not possible, and I have no desire to extend it to do so. It was discussed a little IIRC on the duplicates thread in the development forums. For a start it would be intolerably slow. Secondly it wouldn't work with all formats (you have mentioned EPUB only - this plugin looks for duplicates across all formats). And thirdly, where do you draw the line - what about a slightly different cover image, a tweak to the stylesheet, etc etc. All this plugin can do is put you in the ballpark of telling you that two formats appear to be duplicates based on their title, authors etc that you have associated with them in Calibre. Whether in fact you decide their text contents are "near identical" as part of your resolution process to decide which to keep is a whole different kettle of fish, and not something I see it ever attempting to address. As I have mentioned several times before I see it as potentially something that an enhanced "SmartMerge" plugin could attempt to do. However I personally don't have a need for it any more (I have changed how I add my books to my library to negate the likelihood of duplicates in the first place) so I leave it to someone else to develop such a plugin... |
07-31-2011, 09:54 AM | #116 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: prs-600
|
@kiwidude
okee, clear answer, and I understand where it comes from. I also would like to change the way books are added in order to prevent duplicate entries, but when you are starting with a "messy library" these tools can help you to a certain point. I was hoping this point was a bit futher on, but from the "it should work for all books" point of view i can see why this plugin does not do that. |
07-31-2011, 01:02 PM | #117 |
Groupie
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
|
@rigolo
I found kiwidude's Count Pages plugin helped me identify a bunch of "identical in content" duplicates. I suppose two versions of a book with identical word counts might not actually be duplicates, but that's a risk I'm willing to ignore |
08-05-2011, 11:32 AM | #118 | |
Addict
Posts: 352
Karma: 103850
Join Date: Apr 2011
Device: Kindle NT
|
For some reason plug in ignores the
Quote:
|
|
08-05-2011, 06:16 PM | #119 |
Groupie
Posts: 156
Karma: 10001
Join Date: Feb 2011
Device: sony
|
@Noughty
It's working for me (and has been) -- Find Duplicates 1.1.4 Calibre 0.8.13 Do you have more detailed information about what you're trying? |
08-06-2011, 06:19 AM | #120 |
Addict
Posts: 352
Karma: 103850
Join Date: Apr 2011
Device: Kindle NT
|
I found the problem. Before I didn't need to choose restrict to current search (probably always was chosen).
After finally finding all dupes I decided to fix them and only accidentally saw that I planned to delete different books. They have the same author and title. Apparently it is the same book divided to 3 parts. They even have the same ISBN. I was wondering if it is possible for plug in to check series field (it can show if it is different books). I was wondering the same about formats. Could it search for duplicates with different formats (the ones I would like to merge)? Since know you need to check formats manually. Also maybe it could search more by size? Calibre shows only 0,X MB. If it showed it more detail in KB it would be easier to see if it is a dupe format. Just throwing ideas, maybe some can be explored and implemented Last edited by Noughty; 08-06-2011 at 06:57 AM. |
Tags |
cross library duplicates, in library duplicates |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[GUI Plugin] Quality Check | kiwidude | Plugins | 1184 | 04-17-2024 06:17 PM |
[GUI Plugin] View Manager | kiwidude | Plugins | 414 | 04-13-2024 01:41 PM |
[GUI Plugin] Open With | kiwidude | Plugins | 403 | 04-01-2024 08:39 AM |
[GUI Plugin] Generate Cover | kiwidude | Plugins | 811 | 03-16-2024 11:31 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |