.... and again duplicates ....

jekkii · 02-09-2011, 03:55 AM

I systemize and sort my library till now on CDs with Calibre. I always find new duplicates.
I don't know how Python works as for finding duplicates (comparing titles and authors?).
Earlier i used a software programmed in Visual Basic 6 and based on MS Access (i already mentioned, this : http://depositfiles.com/de/files/xmgx3g3nr). It is fairly rudimentary but one good thing was that the duplicates were really well filtered. There were calculated MD5 hashes of the book files during the scan process and the files with the same hash identified as duplicates. I have only a vague imagination what hash is and how complicated it would be to integrate this process in Calibre, but the result was very good.
So if i have e.g. a book with the title "Nice world" and the same book with the title "World nice" (because the scanner haven't made somehow the right job), Calibre finds them two different books although they are same. On the other way (per hash) there would have been identified as duplicates.
Best regards

kacir · 02-09-2011, 04:44 AM

Quote:

Originally Posted by jekkii

I systemize and sort my library till now on CDs with Calibre. I always find new duplicates.
I don't know how Python works as for finding duplicates (comparing titles and authors?).
Earlier i used a software programmed in Visual Basic 6 and based on MS Access (i already mentioned, this : http://depositfiles.com/de/files/xmgx3g3nr). It is fairly rudimentary but one good thing was that the duplicates were really well filtered. There were calculated MD5 hashes of the book files during the scan process and the files with the same hash identified as duplicates. I have only a vague imagination what hash is and how complicated it would be to integrate this process in Calibre, but the result was very good.
So if i have e.g. a book with the title "Nice world" and the same book with the title "World nice" (because the scanner haven't made somehow the right job), Calibre finds them two different books although they are same. On the other way (per hash) there would have been identified as duplicates.
Best regards

See this thread for better discussion.
https://www.mobileread.com/forums/sho...d.php?t=118013
A system for Duplicate detection is in the making.

There is a problem with your hash theory.
You can have a duplicate book but version in Calibre is in txt format and version what you are adding is in epub. So you want the program to add the epub next to the txt.
I have many, many texts in Calibre that are in several formats.

itimpi · 02-09-2011, 06:08 AM

There is also the fact that a book can be a duplicte even though it is not byte identical to an existing file. for instance it might just have different metadata stored inside it.

The key point is that Calibre is working at the 'book' level and not the 'file' level when considering duplicates.

DoctorOhh · 02-09-2011, 07:07 AM

Quote:

Originally Posted by itimpi

Quote:

Originally Posted by jekkii

So if i have e.g. a book with the title "Nice world" and the same book with the title "World nice" (because the scanner haven't made somehow the right job), Calibre finds them two different books although they are same. On the other way (per hash) there would have been identified as duplicates.

There is also the fact that a book can be a duplicte even though it is not byte identical to an existing file. for instance it might just have different metadata stored inside it.

The key point is that Calibre is working at the 'book' level and not the 'file' level when considering duplicates.

Excellent point, anyone wishing to use MD5 hashes to find dupes should do so prior to adding the books. Just viewing a epub in calibre changes the book file and thus the hash due to adding or changing the bookmark.

kiwidude · 02-09-2011, 08:20 AM

Quote:

Originally Posted by dwanthny

Excellent point, anyone wishing to use MD5 hashes to find dupes should do so prior to adding the books.

Couldn't agree more - there are reasons why we haven't bothered to mention any kind of hash duplicate comparisons in the proposals for duplicate functionality in Calibre and you have covered most of them.

Personally I am very much in the "before you add to Calibre" camp. Why waste your time cleaning up filenames of files (or fixing up metadata inside Calibre)? Just run a hash comparison using any one of a number of free utilities out there on the internet first on your source folder and Calibre, then delete from the source folder. Don't directly delete from Calibre's folders though - or if you do you will need to run one of the repair database options to get Calibre's internal database matching the fact that a book format is no longer present.

Quote:

Just viewing a epub in calibre changes the book file and thus the hash due to adding or changing the bookmark.

Ok, so this used to "always" be the case. Then a number of us campaigned to have an option available such that the EPUB would *not* be touched. It screws with incremental backups and obviously hash comparisons for those of use who have no interest in "touching" an EPUB just by opening it in the viewer. I wrote a hacky patch then chaley from memory did the job "properly" in a Calibre release a few months ago.

In the ebook viewer preferences if you disable "Remember the current page when quitting" and don't add bookmarks then your EPUB should remain untouched - or at least that was the hope

.

02-09-2011, 03:55 AM	#1
jekkii Member Posts: 13 Karma: 10 Join Date: Jan 2011 Device: none	.... and again duplicates .... I systemize and sort my library till now on CDs with Calibre. I always find new duplicates. I don't know how Python works as for finding duplicates (comparing titles and authors?). Earlier i used a software programmed in Visual Basic 6 and based on MS Access (i already mentioned, this : http://depositfiles.com/de/files/xmgx3g3nr). It is fairly rudimentary but one good thing was that the duplicates were really well filtered. There were calculated MD5 hashes of the book files during the scan process and the files with the same hash identified as duplicates. I have only a vague imagination what hash is and how complicated it would be to integrate this process in Calibre, but the result was very good. So if i have e.g. a book with the title "Nice world" and the same book with the title "World nice" (because the scanner haven't made somehow the right job), Calibre finds them two different books although they are same. On the other way (per hash) there would have been identified as duplicates. Best regards

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre Duplicates	mitch13	Calibre	5	11-13-2010 06:42 AM
Possible Bug on Duplicates	Giuseppe Chillem	Calibre	3	05-06-2010 07:26 PM
Duplicates	pauldadams	Calibre	17	05-04-2010 11:57 PM
Duplicates...	jaxx6166	Sony Reader	5	07-09-2009 09:13 PM
duplicates in database	RJA	Calibre	3	06-22-2009 09:06 AM

02-09-2011, 06:08 AM	#3
itimpi Wizard Posts: 4,552 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	There is also the fact that a book can be a duplicte even though it is not byte identical to an existing file. for instance it might just have different metadata stored inside it. The key point is that Calibre is working at the 'book' level and not the 'file' level when considering duplicates.

Advert

Advert