MobileRead Forums - View Single Post

davidfor · 01-18-2012, 10:16 PM

Quote:

Originally Posted by Ian_Stott

When extracting the text from a document, all of the HTML formatting is removed so that only the plain text is used.

I found this when I actually read the manual.

Quote:

One of the implications of using the TF-IDF method for describing the text of a document is that it focuses its importance upon the unusual words in a set of documents. The good side of this is that common words, eg for, of, him, her etc become irrelevant, as they occur in all (english) documents. The downside of this is that if you are comparing only 2 documents, only the words that are different between them will count and so the similarity score is likely to be low (especially for the tanimoto score).
However, if you select the books in the Ender series as well as a lot of other sci-fi books (eg all of the other books by Orson Scott Card), then you will find that the Ender books are scoring far higher.

When I did this with the Orson Scott Card books in my library, using Ender's Game as the target, Ender in exile came out top, with a tanimoto score of 0.51, Enders shadow at 0.23 and Speaker at 0.21.

My problem was that I was comparing three copies of Ender's Game and the scores were almost zero. These were epubs that had come from different sources. I think one was converted from a LIT, I'm not sure about the others. I did quick scan through them, and the text appeared to be the same.

Just now, I took a copy of one of these epubs, change the name of the file and the title using Sigil and added it to calibre. The similarity score for this new book was 0.0000354521. But, apart from one extra word in the metadata, the books are be the same.

Some of the other tests are OK. Comparing "Speaker for the Dead", "Xenocide" with "Children of the Mind" gave 0.24976 and 0.335613 respectively. Those scores make sense. But, my comparison of Ender's Game with Speaker and Shadow gives zero for both of them.

And now I am a little bit more baffled. I get different result if I compare two books, than if I compare more. Comparing all the above books individually to Game, gave a zero score for each. But, comparing them at the same time, gave scores between 0.001 and 0.094. I thought the comparison was always to the first selected book.

Quote:

A more satisfactory approach may be to replace the TF-IDF method with a some form of word count where the common words have been removed. However, this would require a dictionary that is language based. I have been ponted towards some python based text informatics libraries that would help with this - but I didn't want to launch into these for v1.

That is the problem with this sort of thing. There are lots of different ways to do it and you have to decide on one. Taking the simple approach at this point makes a lot of sense.