Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 07-03-2011, 07:44 PM   #106
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
perfectly working
drMerry is offline   Reply With Quote
Old 07-04-2011, 02:58 AM   #107
domee
Member
domee began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Dec 2010
Device: none
Great plugin!
domee is offline   Reply With Quote
Old 07-15-2011, 07:58 PM   #108
saintly
Junior Member
saintly began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jul 2011
Device: Kindle
@kiwidude: I love your plugin! Before another user tipped me off to it, I was using a lot of Perl scripts to manage my collection. It looks like you were way ahead of my efforts, and your plugin found hundreds of duplicates I missed.

If I may offer a suggestion;
I previously had lots of books with the series in the title. ("Doctor Who: Something or Other" / "Star Trek: Something"). In order to detect duplicates, I used this technique:
- fuzzy author match (same as yours: lastname + 1st initial)
- Split up the title on these characters: "-:;,&" and the word "and".
- Alert for a possible match if any of those pieces matched any other books
- Allow for a piece to be 'whitelisted', so that it won't trip on 'Doctor Who' all the time

That allows me to detect "Doctor Who: Something or Other" and the book "Something or Other" by the same author. Additionally, it can detect combos like:
"Nightfall's Sequel"
"Nightfall; Nightfall's Sequel; The third Nightfall Book" (an e-book that includes the text of 3 other books, a somewhat rare occurrence)
saintly is offline   Reply With Quote
Old 07-16-2011, 07:27 AM   #109
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,224
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Thx @domee/@saintly, and welcome to MobileRead.

It is an interesting suggestion, and I can see your use case for it. Whether there is enough reason to justify the effort is always the question, as this would require a non-trivial amount of effort to slot in another algorithm. I don't have the time to seriously investigate it myself at the moment, but we have your sugggestion documented here so that it may be revisited in the future which is great.
kiwidude is offline   Reply With Quote
Old 07-22-2011, 11:24 AM   #110
whitespirit
Junior Member
whitespirit began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jan 2009
Device: sony prs505
catching dutch duplicates

Hello,

I see that your plugin doesn't catch duplicates with titles like "De Verlossing" and "Verlossing, De"

Is there anything I can modify to the plugin to get also these variants ? E.g. ignore words like "De", "Het" and "Een" or is this something you have to program ?

Kind regards,

whitespirit
whitespirit is offline   Reply With Quote
Old 07-22-2011, 01:00 PM   #111
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by whitespirit View Post
Hello,

I see that your plugin doesn't catch duplicates with titles like "De Verlossing" and "Verlossing, De"

Is there anything I can modify to the plugin to get also these variants ? E.g. ignore words like "De", "Het" and "Een" or is this something you have to program ?

Kind regards,

whitespirit
Have you set the tweaks for your indefinite articles and language? I don't know if Kiwidude is using them in this plugin, but the "articles" tweak controls sorting order for titles and Automerge's title matching algorithm.
Starson17 is offline   Reply With Quote
Old 07-22-2011, 02:05 PM   #112
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,224
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@starson17 - the Find Duplicates plugin has this line that would have been blatantly stolen from Automerge in the title find/replace patterns...
Code:
(tweaks.get('title_sort_articles', r'^(a|the|an)\s+'), ''),
I never did look into what it did
kiwidude is offline   Reply With Quote
Old 07-22-2011, 02:52 PM   #113
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiwidude View Post
@starson17 - the Find Duplicates plugin has this line that would have been blatantly stolen from Automerge in the title find/replace patterns...
Code:
(tweaks.get('title_sort_articles', r'^(a|the|an)\s+'), ''),
I never did look into what it did
I blatantly stole it from whoever wrote the tweak.

When I first put together Automerge, the "articles" were hard coded in English. I copied the hard coded stuff. Later, I think it was Charles who pushed it into the Tweaks. He didn't find my little theft of the original hard coding, so he didn't replace my code. When someone complained that Automerge didn't respect the tweak, I tracked down his work and stole that too.(or maybe it was Kovid's ?)

Edit:@Whitespirit you want to look in preferences under tweaks for this option. It refers to "articles" in quotes - I forget the exact name for it.

Last edited by Starson17; 07-22-2011 at 02:56 PM.
Starson17 is offline   Reply With Quote
Old 07-31-2011, 08:22 AM   #114
rigolo
Junior Member
rigolo began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: prs-600
compare contents of epub files

I have been using your pluging to clean up the library that was pieced together here by our kids. It contained a lot of duplicates that I could find using the binary comparison option.

How ever I am also finding duplicate epub files that are not binary equal.

Looking at the files shows that they are the same size, but within the " epub zip" there are some differences in the opf file.

here an example:

Het loterijbriefje - Jules Verne.epub

this is a epub from the gutenberg project (ebook #30929)

in the metadata section of the opf there is a small change:
<dc:creator role="aut" file-as="Verne, Jules">Jules Verne</dc:creator>
<dc:identifier scheme="ISBN"></dc:identifier>

These 2 lines have been switched .. making it (from a binary standpoint) a different epub, but contents wise it is 100% identical.

Is there a way to also find these find of duplicates? just looking a the metadata alone will not garentee that the actual contents is the same.

I now used a trial version of altova diffdog to compare the contents of the two epub files. But it must be possible to do this automatically from within the plugin.

when doing the metadata compare, do you use the opf from the calibre library? or the opf as contained inside the epub?
rigolo is offline   Reply With Quote
Old 07-31-2011, 08:40 AM   #115
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,224
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@rigolo - welcome to MobileRead.

In answer to your question - none of the Find Duplicate comparisons ever look inside the format (e.g. inside the EPUB). Nor do they directly look at the opf files sitting in the directory. For all but the binary comparison they use the data stored inside the metadata.db database that Calibre uses to manage your library - in theory this should match what those metadata.opf files contain within each book's folder but as I said above they are not directly compared.

The binary comparison is exactly that - comparing effectively byte for byte that two files match.

Trying to compare the internal contents of a book format using this plugin is not possible, and I have no desire to extend it to do so. It was discussed a little IIRC on the duplicates thread in the development forums. For a start it would be intolerably slow. Secondly it wouldn't work with all formats (you have mentioned EPUB only - this plugin looks for duplicates across all formats). And thirdly, where do you draw the line - what about a slightly different cover image, a tweak to the stylesheet, etc etc.

All this plugin can do is put you in the ballpark of telling you that two formats appear to be duplicates based on their title, authors etc that you have associated with them in Calibre. Whether in fact you decide their text contents are "near identical" as part of your resolution process to decide which to keep is a whole different kettle of fish, and not something I see it ever attempting to address. As I have mentioned several times before I see it as potentially something that an enhanced "SmartMerge" plugin could attempt to do. However I personally don't have a need for it any more (I have changed how I add my books to my library to negate the likelihood of duplicates in the first place) so I leave it to someone else to develop such a plugin...
kiwidude is offline   Reply With Quote
Old 07-31-2011, 09:54 AM   #116
rigolo
Junior Member
rigolo began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: prs-600
@kiwidude

okee, clear answer, and I understand where it comes from. I also would like to change the way books are added in order to prevent duplicate entries, but when you are starting with a "messy library" these tools can help you to a certain point. I was hoping this point was a bit futher on, but from the "it should work for all books" point of view i can see why this plugin does not do that.
rigolo is offline   Reply With Quote
Old 07-31-2011, 01:02 PM   #117
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 150
Karma: 10001
Join Date: Feb 2011
Device: sony
@rigolo

I found kiwidude's Count Pages plugin helped me identify a bunch of "identical in content" duplicates.
I suppose two versions of a book with identical word counts might not actually be duplicates, but that's a risk I'm willing to ignore
capnm is offline   Reply With Quote
Old 08-05-2011, 11:32 AM   #118
Noughty
Addict
Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.
 
Posts: 352
Karma: 103850
Join Date: Apr 2011
Device: Kindle NT
For some reason plug in ignores the

Quote:
Searches either your entire library or respecting any search restriction set at the time you Find Duplicates.
It always search all books but it worked before for searching in search restrictions. Everything is updated (plug in and calibre).
Noughty is offline   Reply With Quote
Old 08-05-2011, 06:16 PM   #119
capnm
Groupie
capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'capnm knows the difference between 'who' and 'whom'
 
Posts: 150
Karma: 10001
Join Date: Feb 2011
Device: sony
@Noughty

It's working for me (and has been) --
Find Duplicates 1.1.4
Calibre 0.8.13

Do you have more detailed information about what you're trying?
capnm is offline   Reply With Quote
Old 08-06-2011, 06:19 AM   #120
Noughty
Addict
Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.Noughty is cognizant of many things which escape those who dream only by night.
 
Posts: 352
Karma: 103850
Join Date: Apr 2011
Device: Kindle NT
I found the problem. Before I didn't need to choose restrict to current search (probably always was chosen).

After finally finding all dupes I decided to fix them and only accidentally saw that I planned to delete different books. They have the same author and title. Apparently it is the same book divided to 3 parts. They even have the same ISBN. I was wondering if it is possible for plug in to check series field (it can show if it is different books).

I was wondering the same about formats. Could it search for duplicates with different formats (the ones I would like to merge)? Since know you need to check formats manually.

Also maybe it could search more by size? Calibre shows only 0,X MB. If it showed it more detail in KB it would be easier to see if it is a dupe format.

Just throwing ideas, maybe some can be explored and implemented

Last edited by Noughty; 08-06-2011 at 06:57 AM.
Noughty is offline   Reply With Quote
Reply

Tags
cross library duplicates, in library duplicates

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Generate Cover kiwidude Plugins 489 08-15-2014 09:39 AM
[GUI Plugin] Quality Check kiwidude Plugins 738 08-02-2014 10:06 PM
[GUI Plugin] View Manager kiwidude Plugins 82 08-01-2014 12:37 PM
[GUI Plugin] Open With kiwidude Plugins 228 07-31-2014 01:06 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM


All times are GMT -4. The time now is 06:34 AM.


MobileRead.com is a privately owned, operated and funded community.