Duplicate detection plugin - Page 15

drMerry · 04-29-2011, 12:36 PM

Quote:

Originally Posted by kovidgoyal

Having lots of data in a db does not make it slow, unless the db is very poorly designed. And there is nothing preventing a plugin from using its own db if it feels like it.

own db is great, but the nice thing of calibre is it has already the ability to query a db so it would be nice to use this functionality.

Even good designed db's can become slow(er) but then they are TB's of size. So you're right that good designed db's are not slow. But there is no 'start-test' for plugin builders to proof they can build good db's. Your db may be good, but if a plugin builder adds own info to the db, it can slow down the complete process. If it is a plugin that runs all ways, calibre is the one to be blamed (slow program).

This combination is why it would be nice to have a second db that can be used by plugin-builders and can be queried by the calibre - API.

chaley · 04-29-2011, 01:18 PM

@kiwidude: couldn't get to this until now.

The problem is as Kovid alluded -- SQLite sucks at database writes, at least on windows. For each transaction, it creates a new journal file, then deletes it. This process is *slow*.

I built a version of the code you gave me that reads all the data at startup then writes it all back, using 2 new API methods. Running it on my production DB, I get the following times. Note that I set the size to 1 to force a collision group containing all formats.

Code:

DO THE HASHING
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  0.25
Analysed 100 after  0.438000202179
Analysed 150 after  0.648000001907
...
Analysed 2000 after  9.30000019073
Analysed 2050 after  9.51600003242
Analysed 2100 after  9.96700000763
Completed duplicate analysis in: 10.7790000439
Found 0 duplicate groups covering 0 books

RUN AGAIN, USING MAP
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  0.0350000858307
Analysed 100 after  0.0360000133514
Analysed 150 after  0.0380001068115
...
Analysed 2000 after  0.0520000457764
Analysed 2050 after  0.0520000457764
Analysed 2100 after  0.0529999732971
Completed duplicate analysis in: 1.18799996376

QUIT CALIBRE AND RUN AGAIN WITH MAP TO REDUCE CACHE EFFECTS
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  0.0409998893738
Analysed 100 after  0.0429999828339
Analysed 150 after  0.0439999103546
...
Analysed 2000 after  0.0569999217987
Analysed 2050 after  0.0569999217987
Analysed 2100 after  0.0579998493195
Completed duplicate analysis in: 1.10799980164
Found 0 duplicate groups covering 0 books

The changed code is under the spoiler.

Spoiler:

Code:

    def find_candidates(self, book_ids):
        '''
        Override the default implementation so we can do multiple passes as a more
        efficient approach to finding binary duplicates.
        '''
        # Our first pass will be to find all books that have an identical file size
        candidates_size_map = defaultdict(set)
        formats_count = 0
        for book_id in book_ids:
            formats_count += self.find_candidate_by_file_size(book_id, candidates_size_map)

        # Perform a quick pass through removing all groups with < 2 members
        self.shrink_candidates_map(candidates_size_map)
        if DEBUG:
            prints('Pass 1: %d formats created %d size collisions' % (formats_count, len(candidates_size_map)))

        # Our final pass is to build our result set for this function
        candidates_map = defaultdict(set)
        hash_count = 0
        start = time.time()
        hash_map = self.db.get_all_custom_book_data('find_duplicates', default={})
        result_hash_map = {}
        for size, size_group in candidates_size_map.iteritems():
            for book_id, fmt, fmt_path, mtime in size_group:
                self.find_candidate_by_hash(book_id, fmt, fmt_path, mtime, size, candidates_map, hash_map, result_hash_map)
                hash_count += 1
                if hash_count % 50 == 0:
                    prints('Analysed %d after '%hash_count, time.time() - start)
        self.db.add_multiple_custom_book_data('find_duplicates', result_hash_map)
        return candidates_map

    def find_candidate_by_file_size(self, book_id, candidates_map):
        formats = self.db.formats(book_id, index_is_id=True, verify_formats=False)
        count = 0
        for fmt in formats.split(','):
            fmt_path = self.db.format_abspath(book_id, fmt, index_is_id=True)
            if fmt_path:
                try:
                    stats = os.stat(fmt_path)
                    mtime = stats.st_mtime
                    size = stats.st_size
                    candidates_map[size].add((book_id, fmt, fmt_path, mtime))
                    count += 1
                except:
                    traceback.print_exc()
                    pass
        return count

    def add_to_hash_map(self, hash_map, book_id, fmt, book_data):
        if book_id not in hash_map:
            hash_map[book_id] = {}
        hash_map[book_id][fmt] = book_data

    def find_candidate_by_hash(self, book_id, fmt, fmt_path, mtime, size, candidates_map, hash_map, result_hash_map):
        # Work out whether we need to calculate a hash for this file from
        # book plugin data from a previous run
        book_data = hash_map.get(book_id, {}).get(fmt, {})
        if book_data.get('mtime', None) == mtime:
            sha = book_data.get('sha', None)
            size = book_data.get('size', None)
            if sha and size:
                candidates_map[(sha, size)].add(book_id)
                self.add_to_hash_map(result_hash_map, book_id, fmt, book_data)
                return
        try:
            with open(fmt_path, 'rb') as f:
                content = f.read()
            sha = hashlib.sha256()
            sha.update(content)
            hash = (sha.hexdigest(), size)
            candidates_map[hash].add(book_id)
            # Store our plugin book data for future repeat scanning
            book_data['mtime'] = mtime
            book_data['sha'] = sha.hexdigest()
            book_data['size'] = size
            self.add_to_hash_map(result_hash_map, book_id, fmt, book_data)
        except:
            traceback.print_exc()

I have submitted the calibre API changes.

kiwidude · 04-29-2011, 01:39 PM

Thx very much for the API changes Charles, I'll give it another attempt when Kovid merges them - looks like you missed the 0.7.58 cutoff. So depending on how long it takes me to sort out all the exemption stuff I may save it for another release in a weeks time.

Obviously I want to give it a thrash on the large library to see the impact. As per my last post on this the biggest hit from my results is by far the os.stat pass which this change can't help with. Out of curiosity - did you try restarting your PC and running the check again - do the numbers come back a lot higher than the 1.1 seconds (i.e. removing the os.stat cache effect)?

chaley · 04-29-2011, 01:56 PM

Quote:

Originally Posted by kiwidude

Thx very much for the API changes Charles, I'll give it another attempt when Kovid merges them - looks like you missed the 0.7.58 cutoff. So depending on how long it takes me to sort out all the exemption stuff I may save it for another release in a weeks time.

Yea, seems I was an hour late. Blame the royal wedding.

Quote:

Obviously I want to give it a thrash on the large library to see the impact. As per my last post on this the biggest hit from my results is by far the os.stat pass which this change can't help with. Out of curiosity - did you try restarting your PC and running the check again - do the numbers come back a lot higher than the 1.1 seconds (i.e. removing the os.stat cache effect)?

I didn't restart the PC. However, the os.stat effects are in the first number because that wasn't the first time I ran the test, so at least the times are comparable.

I will rerun everything with booting before each run.

chaley · 04-29-2011, 02:27 PM

Results of test including booting.

Test sequence: empty the plugin data table. Do twice: boot & wait for quiescence. Start calibre, push dup check button. Stop calibre.

The test is three times faster with the cache than without it.

Code:

First run. Requires hashing all books.
Starting up...
Started up in 2.7619998455
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  1.2009999752
Analysed 100 after  2.23099994659
Analysed 150 after  3.25999999046
...
Analysed 2000 after  44.8190000057
Analysed 2050 after  45.9270000458
Analysed 2100 after  47.4549999237
Completed duplicate analysis in: 70.2000000477
Found 0 duplicate groups covering 0 books

Second run. No books are hashed.
Started up in 2.77799987793
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  0.202000141144
Analysed 100 after  0.202000141144
Analysed 150 after  0.202000141144
...
Analysed 2000 after  0.249000072479
Analysed 2050 after  0.249000072479
Analysed 2100 after  0.249000072479
Completed duplicate analysis in: 23.368999958
Found 0 duplicate groups covering 0 books

It seems that the stat() time for these 2100 formats is 23 seconds, or around 11 MS/book. The hash time is around 23 MS/book. Cache hits are basically free, regardless of the number of books.

kiwidude · 04-29-2011, 02:55 PM

Thx for that Charles, those numbers look much more representative which is great. That os.stat time is a killer isn't it?

Multiplying your numbers above out over the 7500 formats I was scanning that comes out at roughly 3 minutes for the first run hashing time is what I should roughly be expecting, plus the os.stat scan time. Hopefully much better than the 24 minutes total of before...

I see Kovid has merged the changes now so I shall have an experiment later.

chaley · 04-29-2011, 03:21 PM

Quote:

Originally Posted by kiwidude

I see Kovid has merged the changes now so I shall have an experiment later.

Note the slight API change that the code I posted didn't use, but should. I changed the add_multiple method to have a parameter that tells it to delete all existing data for that name before adding the new data.

Code:

add_multiple_custom_book_data(self, name, vals, delete_first=False):

Also note that 'vals' must be an iterable that has a book_id as a key/index and the object to be serialized as the data. Either a list or a dict will work. In my test I used a dict.

The get_all_custom_book_data method does what you would expect.

For completeness I also added delete_all_custom_book_data(self, name). It removes all records with 'name', regardless of book_id.

mehetabelo · 04-29-2011, 08:29 PM

I had another thought concerning finding duplicates, something I just ran across doing some normal cleanup. Some books, when imported merge the series and title information together.
So it's possible a book such as Dragon Champion by EE Knight, Age of Fire Book [1]
can show up as:
Age of Fire 1: Dragon Champion or
Dragon Champion Book 1 in the Age of Fire saga.

I know for a fact that the first option does not show up with the duplicate detector when it exists. Now, as with the earlier thought with the title/author switch, it is a partly a matter of metadata not being clean. However, with this the title may not seem wrong with a general scan, and it would be difficult for the QC plugin to find something like this.

The point is that at some point in the future it may be a good addition to somehow match use both title and series in a scan for duplicates. While not even close to a top priority, it is a way for duplicates to go undetected, and I figure'd I'd mention it.

Of course, anyone can use 'ignore title' to scan through this sort of stuff, but when I personally do that, I come up with over 700 matches. I'm probably just going overboard with thoughts, considering all the plugins you work on kiwi, but I'm more than willing to throw out stupid ass ideas.

I am a little OC about things, and I want my database to be so damn clean it's impossible without extreme time spent checking the 50,000 books line by line. Up until a few days ago I had managed to not talk on the boards even though I've been a member of mobilereads for years, now... I can't keep my mouth shut it seems.

kiwidude · 04-30-2011, 05:38 AM

@Mehetabelo - duplicate finder is not the right plugin for finding books with series in the title. Remember it is about finding duplicates. What if you only had one book called "Age of Fire 1: Dragon Champion"? None of duplicate checks algorithms could find that.

Quality Check or direct searches are your best bet. Quality Check already has a "Check titles with series" option which would find both those books above (you can customise it further in the QC options dialog). Yes it will also bring back other false positives and it has no exclusion capability (though I might consider adding that in future). Once you have your titles cleaned up, then run your duplicate check...

Paul in Tucson · 04-30-2011, 04:28 PM

Technically, you're way ahead of me, but I'd like to applaud your effort. Duplicates are a big problem for me and eliminating them is time consuming. Your work will be very much appreciated here!

kiwidude · 04-30-2011, 04:54 PM

Hi Paul, welcome to MobileRead. Your thanks are appreciated. There will be another new version released on the "official" thread here very soon so keep an eye out...

kiwidude · 04-30-2011, 05:11 PM

This is a preview before an official 1.1 release. I would greatly appreciate it if a few people could give it a sanity check before I release it in the plugins forum thread. There have been a significant number of internal changes so I would rather get any problems found here first

You will need Calibre 0.7.59 as I have incorporated Chaley's additions to the API to speed up subsequent binary comparison scans.

Changes in this release compared to 1.0:

Support a new search type of binary comparisons
Redesigned GUI dialog to make more obvious the choice of Title/Author, ISBN and Binary
Replaced the way exemptions are persisted and managed to make more manageable users choosing to exempt very large groups

Note that you will lose any exemptions you had set previously - I decided it wasn't worth the effort to try to migrate them. If you know that for a particular algorithm you have no duplicates in the results from using the plugin previously, then you can just run it again and mark all groups as exempt to store the new exemptions.

Particular thanks to chaley for his code snippets and ideas for many of the changes.

Any feedback much appreciated.

drMerry · 04-30-2011, 06:25 PM

When I did a binary search on my books,
I made a filter for 218 books (the books you know are my non-duplicates with same name.)

Surprise!
In stead of scanning these books, I got a message: Scanning 8000 books for duplicates.

The script is fast, but it is on my (high-loaded) nas so this is not a nice bug to have if you get it often.

I just let it go on en will retest with other options and report any new errors if I find them

By the way, nice GUI change!!

kiwidude · 04-30-2011, 06:30 PM

Ahh, thx drMerry. The ISBN compare will have that same bug. Will put a new version up when fixed, I know why it is.

drMerry · 04-30-2011, 06:37 PM

At the other hand, I go to sleep now, tomorrow I will (hopefully) have a list of all binary dups. So that is one thing I do not have to do again

(surprise 2. This bug is on 2 options. Both new. Hmmm could this be an accident?)
status fixed?

04-29-2011, 08:29 PM	#218
mehetabelo e-Bibliophile Posts: 60 Karma: 10 Join Date: Jun 2009 Location: California Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2	I had another thought concerning finding duplicates, something I just ran across doing some normal cleanup. Some books, when imported merge the series and title information together. So it's possible a book such as Dragon Champion by EE Knight, Age of Fire Book [1] can show up as: Age of Fire 1: Dragon Champion or Dragon Champion Book 1 in the Age of Fire saga. I know for a fact that the first option does not show up with the duplicate detector when it exists. Now, as with the earlier thought with the title/author switch, it is a partly a matter of metadata not being clean. However, with this the title may not seem wrong with a general scan, and it would be difficult for the QC plugin to find something like this. The point is that at some point in the future it may be a good addition to somehow match use both title and series in a scan for duplicates. While not even close to a top priority, it is a way for duplicates to go undetected, and I figure'd I'd mention it. Of course, anyone can use 'ignore title' to scan through this sort of stuff, but when I personally do that, I come up with over 700 matches. I'm probably just going overboard with thoughts, considering all the plugins you work on kiwi, but I'm more than willing to throw out stupid ass ideas. I am a little OC about things, and I want my database to be so damn clean it's impossible without extreme time spent checking the 50,000 books line by line. Up until a few days ago I had managed to not talk on the boards even though I've been a member of mobilereads for years, now... I can't keep my mouth shut it seems. Last edited by mehetabelo; 04-29-2011 at 10:57 PM. Reason: Fixing note

04-30-2011, 05:11 PM	#222
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v1.0.3 Beta This is a preview before an official 1.1 release. I would greatly appreciate it if a few people could give it a sanity check before I release it in the plugins forum thread. There have been a significant number of internal changes so I would rather get any problems found here first You will need Calibre 0.7.59 as I have incorporated Chaley's additions to the API to speed up subsequent binary comparison scans. Changes in this release compared to 1.0: Support a new search type of binary comparisons Redesigned GUI dialog to make more obvious the choice of Title/Author, ISBN and Binary Replaced the way exemptions are persisted and managed to make more manageable users choosing to exempt very large groups Note that you will lose any exemptions you had set previously - I decided it wasn't worth the effort to try to migrate them. If you know that for a particular algorithm you have no duplicates in the results from using the plugin previously, then you can just run it again and mark all groups as exempt to store the new exemptions. Particular thanks to chaley for his code snippets and ideas for many of the changes. Any feedback much appreciated. Attached Thumbnails Last edited by kiwidude; 05-02-2011 at 05:26 AM. Reason: Removed attachment as later version

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Duplicate Detection	Philosopher	Library Management	114	09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
Duplicate Detection	albill	Calibre	2	10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin	cgranade	Plugins	3	09-15-2010 12:11 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM

04-29-2011, 01:39 PM	#213
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Thx very much for the API changes Charles, I'll give it another attempt when Kovid merges them - looks like you missed the 0.7.58 cutoff. So depending on how long it takes me to sort out all the exemption stuff I may save it for another release in a weeks time. Obviously I want to give it a thrash on the large library to see the impact. As per my last post on this the biggest hit from my results is by far the os.stat pass which this change can't help with. Out of curiosity - did you try restarting your PC and running the check again - do the numbers come back a lot higher than the 1.1 seconds (i.e. removing the os.stat cache effect)?

04-29-2011, 02:55 PM	#216
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Thx for that Charles, those numbers look much more representative which is great. That os.stat time is a killer isn't it? Multiplying your numbers above out over the 7500 formats I was scanning that comes out at roughly 3 minutes for the first run hashing time is what I should roughly be expecting, plus the os.stat scan time. Hopefully much better than the 24 minutes total of before... I see Kovid has merged the changes now so I shall have an experiment later.

04-30-2011, 05:38 AM	#219
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Mehetabelo - duplicate finder is not the right plugin for finding books with series in the title. Remember it is about finding duplicates. What if you only had one book called "Age of Fire 1: Dragon Champion"? None of duplicate checks algorithms could find that. Quality Check or direct searches are your best bet. Quality Check already has a "Check titles with series" option which would find both those books above (you can customise it further in the QC options dialog). Yes it will also bring back other false positives and it has no exclusion capability (though I might consider adding that in future). Once you have your titles cleaned up, then run your duplicate check...

04-30-2011, 04:28 PM	#220
Paul in Tucson Junior Member Posts: 1 Karma: 10 Join Date: Nov 2010 Device: iPad	Technically, you're way ahead of me, but I'd like to applaud your effort. Duplicates are a big problem for me and eliminating them is time consuming. Your work will be very much appreciated here!

04-30-2011, 04:54 PM	#221
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hi Paul, welcome to MobileRead. Your thanks are appreciated. There will be another new version released on the "official" thread here very soon so keep an eye out...

04-30-2011, 06:25 PM	#223
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	When I did a binary search on my books, I made a filter for 218 books (the books you know are my non-duplicates with same name.) Surprise! In stead of scanning these books, I got a message: Scanning 8000 books for duplicates. The script is fast, but it is on my (high-loaded) nas so this is not a nice bug to have if you get it often. I just let it go on en will retest with other options and report any new errors if I find them By the way, nice GUI change!!

04-30-2011, 06:30 PM	#224
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Ahh, thx drMerry. The ISBN compare will have that same bug. Will put a new version up when fixed, I know why it is.

04-30-2011, 06:37 PM	#225
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	At the other hand, I go to sleep now, tomorrow I will (hopefully) have a list of all binary dups. So that is one thing I do not have to do again (surprise 2. This bug is on 2 options. Both new. Hmmm could this be an accident?) status fixed?

Advert

Advert