View Single Post
Old 08-17-2020, 04:49 AM   #38
JurriaanK
Junior Member
JurriaanK began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jul 2019
Device: boox note pro, auro h2o
I recently started de-duplicating my library with a newly discovered tool that compares 2 textfiles A, Bagainst each other and gives a percentage of text A occuring in text B.

The Debian package is called similarity-tester, the software was written in 1989 and lives here: https://dickgrune.com/Programs/similarity_tester/

Since I don't regularly use Calibre and only nibble at this forum very occasionally, I write this here, because your plugin seems to me to have nearly all aspects available to use this:

- convert books to text
- run external program
- do something with the result

and I've found no other mention of sim_text in combination with Calibre.

A couple of points I found when using it:

- it takes time to run. 3000 files on an Intel J1900 use about 60 seconds and there is no provision for a progress indicator. Can be added relatively simple, of course. There's three main loops: reading files, hashing files & comparing hashes.
- it runs on a single core. If you split the filelist and run permutations of the split sections on multiple cores, it runs faster - if you have enough memory.
- it takes memory to run also. 3000 files use about 2 GiB of memory.
- some patches to make it compile cleanly exists in Debian's bug tracker.
- the best way to run it is to feed it a list of files (-i parameter), then parse the output and if something is found, run the comparison for those single files in reverse (since if A occurs for 80% in B, maybe B is the 'extended edition' with a short story added, or something like that).

So, maybe someone can use this, I notice that detecting similar books is a regularly occuring question in Calibre, and this is a foolproof method.
JurriaanK is offline   Reply With Quote