MobileRead Forums - View Single Post - [GUI Plugin] Power (Full-text) Search

JurriaanK · 08-17-2020, 04:49 AM

I recently started de-duplicating my library with a newly discovered tool that compares 2 textfiles A, Bagainst each other and gives a percentage of text A occuring in text B.

The Debian package is called similarity-tester, the software was written in 1989 and lives here: https://dickgrune.com/Programs/similarity_tester/

Since I don't regularly use Calibre and only nibble at this forum very occasionally, I write this here, because your plugin seems to me to have nearly all aspects available to use this:

- convert books to text
- run external program
- do something with the result

and I've found no other mention of sim_text in combination with Calibre.

A couple of points I found when using it:

- it takes time to run. 3000 files on an Intel J1900 use about 60 seconds and there is no provision for a progress indicator. Can be added relatively simple, of course. There's three main loops: reading files, hashing files & comparing hashes.
- it runs on a single core. If you split the filelist and run permutations of the split sections on multiple cores, it runs faster - if you have enough memory.
- it takes memory to run also. 3000 files use about 2 GiB of memory.
- some patches to make it compile cleanly exists in Debian's bug tracker.
- the best way to run it is to feed it a list of files (-i parameter), then parse the output and if something is found, run the comparison for those single files in reverse (since if A occurs for 80% in B, maybe B is the 'extended edition' with a short story added, or something like that).

So, maybe someone can use this, I notice that detecting similar books is a regularly occuring question in Calibre, and this is a foolproof method.

08-17-2020, 04:49 AM	#38
JurriaanK Junior Member Posts: 8 Karma: 10 Join Date: Jul 2019 Device: boox note pro, auro h2o	I recently started de-duplicating my library with a newly discovered tool that compares 2 textfiles A, Bagainst each other and gives a percentage of text A occuring in text B. The Debian package is called similarity-tester, the software was written in 1989 and lives here: https://dickgrune.com/Programs/similarity_tester/ Since I don't regularly use Calibre and only nibble at this forum very occasionally, I write this here, because your plugin seems to me to have nearly all aspects available to use this: - convert books to text - run external program - do something with the result and I've found no other mention of sim_text in combination with Calibre. A couple of points I found when using it: - it takes time to run. 3000 files on an Intel J1900 use about 60 seconds and there is no provision for a progress indicator. Can be added relatively simple, of course. There's three main loops: reading files, hashing files & comparing hashes. - it runs on a single core. If you split the filelist and run permutations of the split sections on multiple cores, it runs faster - if you have enough memory. - it takes memory to run also. 3000 files use about 2 GiB of memory. - some patches to make it compile cleanly exists in Debian's bug tracker. - the best way to run it is to feed it a list of files (-i parameter), then parse the output and if something is found, run the comparison for those single files in reverse (since if A occurs for 80% in B, maybe B is the 'extended edition' with a short story added, or something like that). So, maybe someone can use this, I notice that detecting similar books is a regularly occuring question in Calibre, and this is a foolproof method.