@kiwidude: couldn't get to this until now.
The problem is as Kovid alluded -- SQLite sucks at database writes, at least on windows. For each transaction, it creates a new journal file, then deletes it. This process is *slow*.
I built a version of the code you gave me that reads all the data at startup then writes it all back, using 2 new API methods. Running it on my production DB, I get the following times. Note that I set the size to 1 to force a collision group containing all formats.
Code:
DO THE HASHING
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after 0.25
Analysed 100 after 0.438000202179
Analysed 150 after 0.648000001907
...
Analysed 2000 after 9.30000019073
Analysed 2050 after 9.51600003242
Analysed 2100 after 9.96700000763
Completed duplicate analysis in: 10.7790000439
Found 0 duplicate groups covering 0 books
RUN AGAIN, USING MAP
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after 0.0350000858307
Analysed 100 after 0.0360000133514
Analysed 150 after 0.0380001068115
...
Analysed 2000 after 0.0520000457764
Analysed 2050 after 0.0520000457764
Analysed 2100 after 0.0529999732971
Completed duplicate analysis in: 1.18799996376
QUIT CALIBRE AND RUN AGAIN WITH MAP TO REDUCE CACHE EFFECTS
Pass 1: 2100 formats created 1 size collisions
Analysed 50 after 0.0409998893738
Analysed 100 after 0.0429999828339
Analysed 150 after 0.0439999103546
...
Analysed 2000 after 0.0569999217987
Analysed 2050 after 0.0569999217987
Analysed 2100 after 0.0579998493195
Completed duplicate analysis in: 1.10799980164
Found 0 duplicate groups covering 0 books
The changed code is under the spoiler.
Spoiler:
Code:
def find_candidates(self, book_ids):
'''
Override the default implementation so we can do multiple passes as a more
efficient approach to finding binary duplicates.
'''
# Our first pass will be to find all books that have an identical file size
candidates_size_map = defaultdict(set)
formats_count = 0
for book_id in book_ids:
formats_count += self.find_candidate_by_file_size(book_id, candidates_size_map)
# Perform a quick pass through removing all groups with < 2 members
self.shrink_candidates_map(candidates_size_map)
if DEBUG:
prints('Pass 1: %d formats created %d size collisions' % (formats_count, len(candidates_size_map)))
# Our final pass is to build our result set for this function
candidates_map = defaultdict(set)
hash_count = 0
start = time.time()
hash_map = self.db.get_all_custom_book_data('find_duplicates', default={})
result_hash_map = {}
for size, size_group in candidates_size_map.iteritems():
for book_id, fmt, fmt_path, mtime in size_group:
self.find_candidate_by_hash(book_id, fmt, fmt_path, mtime, size, candidates_map, hash_map, result_hash_map)
hash_count += 1
if hash_count % 50 == 0:
prints('Analysed %d after '%hash_count, time.time() - start)
self.db.add_multiple_custom_book_data('find_duplicates', result_hash_map)
return candidates_map
def find_candidate_by_file_size(self, book_id, candidates_map):
formats = self.db.formats(book_id, index_is_id=True, verify_formats=False)
count = 0
for fmt in formats.split(','):
fmt_path = self.db.format_abspath(book_id, fmt, index_is_id=True)
if fmt_path:
try:
stats = os.stat(fmt_path)
mtime = stats.st_mtime
size = stats.st_size
candidates_map[size].add((book_id, fmt, fmt_path, mtime))
count += 1
except:
traceback.print_exc()
pass
return count
def add_to_hash_map(self, hash_map, book_id, fmt, book_data):
if book_id not in hash_map:
hash_map[book_id] = {}
hash_map[book_id][fmt] = book_data
def find_candidate_by_hash(self, book_id, fmt, fmt_path, mtime, size, candidates_map, hash_map, result_hash_map):
# Work out whether we need to calculate a hash for this file from
# book plugin data from a previous run
book_data = hash_map.get(book_id, {}).get(fmt, {})
if book_data.get('mtime', None) == mtime:
sha = book_data.get('sha', None)
size = book_data.get('size', None)
if sha and size:
candidates_map[(sha, size)].add(book_id)
self.add_to_hash_map(result_hash_map, book_id, fmt, book_data)
return
try:
with open(fmt_path, 'rb') as f:
content = f.read()
sha = hashlib.sha256()
sha.update(content)
hash = (sha.hexdigest(), size)
candidates_map[hash].add(book_id)
# Store our plugin book data for future repeat scanning
book_data['mtime'] = mtime
book_data['sha'] = sha.hexdigest()
book_data['size'] = size
self.add_to_hash_map(result_hash_map, book_id, fmt, book_data)
except:
traceback.print_exc()
I have submitted the calibre API changes.