Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 02-09-2011, 03:55 AM   #1
jekkii
Member
jekkii began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jan 2011
Device: none
.... and again duplicates ....

I systemize and sort my library till now on CDs with Calibre. I always find new duplicates.
I don't know how Python works as for finding duplicates (comparing titles and authors?).
Earlier i used a software programmed in Visual Basic 6 and based on MS Access (i already mentioned, this : http://depositfiles.com/de/files/xmgx3g3nr). It is fairly rudimentary but one good thing was that the duplicates were really well filtered. There were calculated MD5 hashes of the book files during the scan process and the files with the same hash identified as duplicates. I have only a vague imagination what hash is and how complicated it would be to integrate this process in Calibre, but the result was very good.
So if i have e.g. a book with the title "Nice world" and the same book with the title "World nice" (because the scanner haven't made somehow the right job), Calibre finds them two different books although they are same. On the other way (per hash) there would have been identified as duplicates.
Best regards
jekkii is offline   Reply With Quote
Old 02-09-2011, 04:44 AM   #2
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by jekkii View Post
I systemize and sort my library till now on CDs with Calibre. I always find new duplicates.
I don't know how Python works as for finding duplicates (comparing titles and authors?).
Earlier i used a software programmed in Visual Basic 6 and based on MS Access (i already mentioned, this : http://depositfiles.com/de/files/xmgx3g3nr). It is fairly rudimentary but one good thing was that the duplicates were really well filtered. There were calculated MD5 hashes of the book files during the scan process and the files with the same hash identified as duplicates. I have only a vague imagination what hash is and how complicated it would be to integrate this process in Calibre, but the result was very good.
So if i have e.g. a book with the title "Nice world" and the same book with the title "World nice" (because the scanner haven't made somehow the right job), Calibre finds them two different books although they are same. On the other way (per hash) there would have been identified as duplicates.
Best regards
See this thread for better discussion.
https://www.mobileread.com/forums/sho...d.php?t=118013
A system for Duplicate detection is in the making.

There is a problem with your hash theory.
You can have a duplicate book but version in Calibre is in txt format and version what you are adding is in epub. So you want the program to add the epub next to the txt.
I have many, many texts in Calibre that are in several formats.
kacir is offline   Reply With Quote
Advert
Old 02-09-2011, 06:08 AM   #3
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 4,552
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
There is also the fact that a book can be a duplicte even though it is not byte identical to an existing file. for instance it might just have different metadata stored inside it.

The key point is that Calibre is working at the 'book' level and not the 'file' level when considering duplicates.
itimpi is offline   Reply With Quote
Old 02-09-2011, 07:07 AM   #4
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by itimpi View Post
Quote:
Originally Posted by jekkii View Post
So if i have e.g. a book with the title "Nice world" and the same book with the title "World nice" (because the scanner haven't made somehow the right job), Calibre finds them two different books although they are same. On the other way (per hash) there would have been identified as duplicates.
There is also the fact that a book can be a duplicte even though it is not byte identical to an existing file. for instance it might just have different metadata stored inside it.

The key point is that Calibre is working at the 'book' level and not the 'file' level when considering duplicates.
Excellent point, anyone wishing to use MD5 hashes to find dupes should do so prior to adding the books. Just viewing a epub in calibre changes the book file and thus the hash due to adding or changing the bookmark.
DoctorOhh is offline   Reply With Quote
Old 02-09-2011, 08:20 AM   #5
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by dwanthny View Post
Excellent point, anyone wishing to use MD5 hashes to find dupes should do so prior to adding the books.
Couldn't agree more - there are reasons why we haven't bothered to mention any kind of hash duplicate comparisons in the proposals for duplicate functionality in Calibre and you have covered most of them.

Personally I am very much in the "before you add to Calibre" camp. Why waste your time cleaning up filenames of files (or fixing up metadata inside Calibre)? Just run a hash comparison using any one of a number of free utilities out there on the internet first on your source folder and Calibre, then delete from the source folder. Don't directly delete from Calibre's folders though - or if you do you will need to run one of the repair database options to get Calibre's internal database matching the fact that a book format is no longer present.
Quote:
Just viewing a epub in calibre changes the book file and thus the hash due to adding or changing the bookmark.
Ok, so this used to "always" be the case. Then a number of us campaigned to have an option available such that the EPUB would *not* be touched. It screws with incremental backups and obviously hash comparisons for those of use who have no interest in "touching" an EPUB just by opening it in the viewer. I wrote a hacky patch then chaley from memory did the job "properly" in a Calibre release a few months ago.

In the ebook viewer preferences if you disable "Remember the current page when quitting" and don't add bookmarks then your EPUB should remain untouched - or at least that was the hope .
kiwidude is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Calibre Duplicates mitch13 Calibre 5 11-13-2010 06:42 AM
Possible Bug on Duplicates Giuseppe Chillem Calibre 3 05-06-2010 07:26 PM
Duplicates pauldadams Calibre 17 05-04-2010 11:57 PM
Duplicates... jaxx6166 Sony Reader 5 07-09-2009 09:13 PM
duplicates in database RJA Calibre 3 06-22-2009 09:06 AM


All times are GMT -4. The time now is 07:39 PM.


MobileRead.com is a privately owned, operated and funded community.