MobileRead Forums - View Single Post

chaley · 04-29-2011, 01:18 PM

@kiwidude: couldn't get to this until now.

The problem is as Kovid alluded -- SQLite sucks at database writes, at least on windows. For each transaction, it creates a new journal file, then deletes it. This process is *slow*.

I built a version of the code you gave me that reads all the data at startup then writes it all back, using 2 new API methods. Running it on my production DB, I get the following times. Note that I set the size to 1 to force a collision group containing all formats.

Code:

DO THE HASHING
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  0.25
Analysed 100 after  0.438000202179
Analysed 150 after  0.648000001907
...
Analysed 2000 after  9.30000019073
Analysed 2050 after  9.51600003242
Analysed 2100 after  9.96700000763
Completed duplicate analysis in: 10.7790000439
Found 0 duplicate groups covering 0 books

RUN AGAIN, USING MAP
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  0.0350000858307
Analysed 100 after  0.0360000133514
Analysed 150 after  0.0380001068115
...
Analysed 2000 after  0.0520000457764
Analysed 2050 after  0.0520000457764
Analysed 2100 after  0.0529999732971
Completed duplicate analysis in: 1.18799996376

QUIT CALIBRE AND RUN AGAIN WITH MAP TO REDUCE CACHE EFFECTS
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after  0.0409998893738
Analysed 100 after  0.0429999828339
Analysed 150 after  0.0439999103546
...
Analysed 2000 after  0.0569999217987
Analysed 2050 after  0.0569999217987
Analysed 2100 after  0.0579998493195
Completed duplicate analysis in: 1.10799980164
Found 0 duplicate groups covering 0 books

The changed code is under the spoiler.

Spoiler:

Code:

    def find_candidates(self, book_ids):
        '''
        Override the default implementation so we can do multiple passes as a more
        efficient approach to finding binary duplicates.
        '''
        # Our first pass will be to find all books that have an identical file size
        candidates_size_map = defaultdict(set)
        formats_count = 0
        for book_id in book_ids:
            formats_count += self.find_candidate_by_file_size(book_id, candidates_size_map)

        # Perform a quick pass through removing all groups with < 2 members
        self.shrink_candidates_map(candidates_size_map)
        if DEBUG:
            prints('Pass 1: %d formats created %d size collisions' % (formats_count, len(candidates_size_map)))

        # Our final pass is to build our result set for this function
        candidates_map = defaultdict(set)
        hash_count = 0
        start = time.time()
        hash_map = self.db.get_all_custom_book_data('find_duplicates', default={})
        result_hash_map = {}
        for size, size_group in candidates_size_map.iteritems():
            for book_id, fmt, fmt_path, mtime in size_group:
                self.find_candidate_by_hash(book_id, fmt, fmt_path, mtime, size, candidates_map, hash_map, result_hash_map)
                hash_count += 1
                if hash_count % 50 == 0:
                    prints('Analysed %d after '%hash_count, time.time() - start)
        self.db.add_multiple_custom_book_data('find_duplicates', result_hash_map)
        return candidates_map

    def find_candidate_by_file_size(self, book_id, candidates_map):
        formats = self.db.formats(book_id, index_is_id=True, verify_formats=False)
        count = 0
        for fmt in formats.split(','):
            fmt_path = self.db.format_abspath(book_id, fmt, index_is_id=True)
            if fmt_path:
                try:
                    stats = os.stat(fmt_path)
                    mtime = stats.st_mtime
                    size = stats.st_size
                    candidates_map[size].add((book_id, fmt, fmt_path, mtime))
                    count += 1
                except:
                    traceback.print_exc()
                    pass
        return count

    def add_to_hash_map(self, hash_map, book_id, fmt, book_data):
        if book_id not in hash_map:
            hash_map[book_id] = {}
        hash_map[book_id][fmt] = book_data

    def find_candidate_by_hash(self, book_id, fmt, fmt_path, mtime, size, candidates_map, hash_map, result_hash_map):
        # Work out whether we need to calculate a hash for this file from
        # book plugin data from a previous run
        book_data = hash_map.get(book_id, {}).get(fmt, {})
        if book_data.get('mtime', None) == mtime:
            sha = book_data.get('sha', None)
            size = book_data.get('size', None)
            if sha and size:
                candidates_map[(sha, size)].add(book_id)
                self.add_to_hash_map(result_hash_map, book_id, fmt, book_data)
                return
        try:
            with open(fmt_path, 'rb') as f:
                content = f.read()
            sha = hashlib.sha256()
            sha.update(content)
            hash = (sha.hexdigest(), size)
            candidates_map[hash].add(book_id)
            # Store our plugin book data for future repeat scanning
            book_data['mtime'] = mtime
            book_data['sha'] = sha.hexdigest()
            book_data['size'] = size
            self.add_to_hash_map(result_hash_map, book_id, fmt, book_data)
        except:
            traceback.print_exc()

I have submitted the calibre API changes.