View Single Post
Old 01-19-2012, 03:02 PM   #9
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
Quote:
Originally Posted by davidfor View Post
My problem was that I was comparing three copies of Ender's Game and the scores were almost zero. These were epubs that had come from different sources. I think one was converted from a LIT, I'm not sure about the others. I did quick scan through them, and the text appeared to be the same.

Just now, I took a copy of one of these epubs, change the name of the file and the title using Sigil and added it to calibre. The similarity score for this new book was 0.0000354521. But, apart from one extra word in the metadata, the books are be the same.
I have submitted a minor update that contains 2 additional Similarity metrics. When comparing 2 or 3 books to see if they are the same, you could try the Tanimoto (binary) method. As this only looks at the presence or absence of words in the pair of documents, as opposed to a weighted word count, as with the TF-IDF method, this should have the following features:
  • The similarity score is only dependant upon the 2 books being compared, as opposed to the whole library under comparison.
  • If two books contain the same words, the similarity score will be 1.
  • As the word count is not used, purely the presence or absence of a word, this will be a far cruder measure.

Let me know if this helps with determiing if your 3 copies of Ender's Game are the same.
Ian_Stott is offline   Reply With Quote