MobileRead Forums - View Single Post

Ian_Stott · 01-18-2012, 12:38 PM

Quote:

Originally Posted by davidfor

I thought I would compare Orson Scott Cards' "Ender's Game" and "Ender's Shadow". When I went to do this, I realised I had three different epub versions of Ender's Game, so I tried them first. They weren't very similar. One scored 0.333909 and the other "5.43363e-05". I did check the files and the they do contain the same text. But, the formatting is very different. Does this mean the comparison include the HTML code as well as the actual text of the book?

When extracting the text from a document, all of the HTML formatting is removed so that only the plain text is used.

Quote:

Originally Posted by davidfor

As you mentioned the Harry Dresden series, I did a test comparing them to the first book. The results are similar to above. With "Euclid", they are all better than 0.9. With "Tanimoto" the closest is book 11 at 0.0295. I'm a little confused on this but it probably means that I don't understand how to interpret the scores properly.

One of the implications of using the TF-IDF method for describing the text of a document is that it focuses its importance upon the unusual words in a set of documents. The good side of this is that common words, eg for, of, him, her etc become irrelevant, as they occur in all (english) documents. The downside of this is that if you are comparing only 2 documents, only the words that are different between them will count and so the similarity score is likely to be low (especially for the tanimoto score).
However, if you select the books in the Ender series as well as a lot of other sci-fi books (eg all of the other books by Orson Scott Card), then you will find that the Ender books are scoring far higher.

When I did this with the Orson Scott Card books in my library, using Ender's Game as the target, Ender in exile came out top, with a tanimoto score of 0.51, Enders shadow at 0.23 and Speaker at 0.21.

A more satisfactory approach may be to replace the TF-IDF method with a some form of word count where the common words have been removed. However, this would require a dictionary that is language based. I have been ponted towards some python based text informatics libraries that would help with this - but I didn't want to launch into these for v1.