View Single Post
Old 01-17-2012, 11:31 PM   #3
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,907
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
An interesting idea and I couldn't resist playing with it. The first few I tried were some short stories in a series downloaded from the same site. These gave scores between .2 and .8 which seemed reasonable. But, then I thought I would compare Orson Scott Cards' "Ender's Game" and "Ender's Shadow". When I went to do this, I realised I had three different epub versions of Ender's Game, so I tried them first. They weren't very similar. One scored 0.333909 and the other "5.43363e-05". I did check the files and the they do contain the same text. But, the formatting is very different. Does this mean the comparison include the HTML code as well as the actual text of the book?

And for completeness, the score for "Ender's Shadow" was zero when compared to "Ender's Game". As the two books are the same story from a different viewpoint, I expected something a little closer.

After writing the above, I remembered there was a choice for the algorithm. The above was using "Tanimoto". I tried them again with "Euclid":

Game with Shadow: 0.997362
The three versions of Game: 0.999999 and 0.999498.
The short stories: between 0.94 and 0.99

Those scores look better but I would almost think they are to close (except the versions of Game). Do you have a reference what the algorithms do?

Added a bit later:

Ok, I opened the help and found the info on the algorithms. I can see the definitions but I'll have to think about them a bit. As you mentioned the Harry Dresden series, I did a test comparing them to the first book. The results are similar to above. With "Euclid", they are all better than 0.9. With "Tanimoto" the closest is book 11 at 0.0295. I'm a little confused on this but it probably means that I don't understand how to interpret the scores properly.

Last edited by davidfor; 01-18-2012 at 12:05 AM. Reason: A little more experimenting and reading
davidfor is offline   Reply With Quote