Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 01-17-2012, 03:05 PM   #1
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
[GUI Plugin] Find Similar Stories

This plug-in helps you to find other books within your Calibre library that are similar to your target book. It does this by examining the full text of the books in your library, as opposed to using tags or other metadata.

Main Features:
  • Indexes all selected books and compares their word "fingerprint" with the target book.
  • Adds similarity score to the user identified "Custom Column".
  • Full HTML help available through the "Help about plugin" option.

Version History:
Spoiler:
  • v1.0.0
    • Inital Release
  • v1.0.5
    • Updated to add two new similarity measures:
      • Cosine : calculates the cosine of the angel between the word vectors for a pair of books.
      • Tanimoto (binary) : uses the tanimoto similarity metric on a binary fingerprint representation of a book, ie noting jus the presence or absence of a word within a book, as opposed to the frequency of occurance.
  • v1.0.53
  • v1.0.57
    • Updated to fix a bug that caused an incompatibility with the find_duplicates plugin.


Special Notes:
  • Requires Calibre v0.8.17 or later.
  • Currently only uses books in MOBI or EPUB formats. You can select which format is the preferred choice.
  • There is full documentation within the plug-in of each of the methods used to assess the similarity of two books. The methods used are described in detail or references provided, should you wish to examine them further.

    To view the documentation, select the help about the plugin menu option, once installed.
  • Methods currently implemented are:
    • Tanimoto
    • Euclid
    • Cosine
    • Tanimoto (binary)
    • PMRA (PubMed related articles)
    See the Help for details on each method.

    Let me know if you would like a particular method implemented (along with the suitable source material).


Installation & Usage:
  • Download the attached zip file and install the plugin/add to context menu or toolbar/restart Calibre as described in the Introduction to plugins thread.
  • The first time you use the plug-in you, need to identify a "Custom Column" that will hold the results.
    • If you have not already created a suitable Custom Column. Do this in the usual way by using the "Add your own columns" dialogue box (found by right-clicking on a column heading in the main view).
    • Select a column type of "floating point numbers" and leave the "Format for numbers" section empty.
    • The first time you run the plug-in, or when you configure it, you will be able to select this column to hold your results.
  • Select the target book. All other selected books will be compared to this one.
  • Add to your selection all of the other books that you wish to compare to your target book. If you are having trouble doing this while ensuring that the target book is the first selected, you can edit the Similarity score of your target book (in your selected custom column) so that it is set to 1, then sort by the Similarity score.
  • Run the plug-in
  • Once it has run, it will ask if you want the results loaded up into your selected custom column. You will have to respond with "Yes" to see the results.
  • The results will appear in your selected Custom Column. The higher the score, the better the match that book is to the target. A similarity score of 1 means that the book is a perfect match (and probably the same book). This means that the target book will have a score of 1.
  • I find it easiest if you sort the booked by your Similarity score, with the highest (most similar) at the top.
  • Full help is available through the "Help about plugin" option from the plugin menu.
Attached Files
File Type: zip similar_stories_plugin - v158.zip (55.7 KB, 41337 views)

Last edited by Ian_Stott; 03-08-2012 at 02:45 PM. Reason: V1.0.47, fixed bug leading to incompatibility with find_duplicates plug-in
Ian_Stott is offline   Reply With Quote
Old 01-17-2012, 09:05 PM   #2
desideria
Member
desideria began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Device: none
I tried to install it and I got this message:

calibre, version 0.8.35
ERROR: Excepción no considerada: <b>AttributeError</b>:'module' object has no attribute 'CalculateSimilarityAction'

Traceback (most recent call last):
File "site-packages\calibre\gui2\preferences\plugins.py", line 299, in add_plugin
File "site-packages\calibre\gui2\preferences\plugins.py", line 387, in check_for_add_to_toolbars
File "site-packages\calibre\customize\__init__.py", line 543, in load_actual_plugin
AttributeError: 'module' object has no attribute 'CalculateSimilarityAction'
desideria is offline   Reply With Quote
Old 01-18-2012, 12:31 AM   #3
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 6,089
Karma: 6238033
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo Touch, Kobo Glo
An interesting idea and I couldn't resist playing with it. The first few I tried were some short stories in a series downloaded from the same site. These gave scores between .2 and .8 which seemed reasonable. But, then I thought I would compare Orson Scott Cards' "Ender's Game" and "Ender's Shadow". When I went to do this, I realised I had three different epub versions of Ender's Game, so I tried them first. They weren't very similar. One scored 0.333909 and the other "5.43363e-05". I did check the files and the they do contain the same text. But, the formatting is very different. Does this mean the comparison include the HTML code as well as the actual text of the book?

And for completeness, the score for "Ender's Shadow" was zero when compared to "Ender's Game". As the two books are the same story from a different viewpoint, I expected something a little closer.

After writing the above, I remembered there was a choice for the algorithm. The above was using "Tanimoto". I tried them again with "Euclid":

Game with Shadow: 0.997362
The three versions of Game: 0.999999 and 0.999498.
The short stories: between 0.94 and 0.99

Those scores look better but I would almost think they are to close (except the versions of Game). Do you have a reference what the algorithms do?

Added a bit later:

Ok, I opened the help and found the info on the algorithms. I can see the definitions but I'll have to think about them a bit. As you mentioned the Harry Dresden series, I did a test comparing them to the first book. The results are similar to above. With "Euclid", they are all better than 0.9. With "Tanimoto" the closest is book 11 at 0.0295. I'm a little confused on this but it probably means that I don't understand how to interpret the scores properly.

Last edited by davidfor; 01-18-2012 at 01:05 AM. Reason: A little more experimenting and reading
davidfor is offline   Reply With Quote
Old 01-18-2012, 12:39 AM   #4
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 6,089
Karma: 6238033
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo Touch, Kobo Glo
And putting on my application developer hat and being a bit nit-picky:

When I worked out what the plugin did (I didn't actually read the full post before trying it), the menu item "Find Similar Books..." started to bug me. It doesn't actually do this. What it does is "Calculate Similarity Scores". To do what you first paragraph states, and the menu item implies, the plugin would need to do the calculation on every book in the library and display a list that had scores greater than defined amount.

A "Reset Scores" option would be a good idea. Calculate the similarity for a set of books, then calculate it for a different set, the two sets of numbers don't have any relationship with each other. This could be a menu option or a setting to clear all the scores before doing a calculation.

Last edited by davidfor; 01-18-2012 at 01:16 AM. Reason: an extra suggestion
davidfor is offline   Reply With Quote
Old 01-18-2012, 01:08 PM   #5
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
Quote:
Originally Posted by desideria View Post
I tried to install it and I got this message:

calibre, version 0.8.35
ERROR: Excepción no considerada: <b>AttributeError</b>:'module' object has no attribute 'CalculateSimilarityAction'

Traceback (most recent call last):
File "site-packages\calibre\gui2\preferences\plugins.py", line 299, in add_plugin
File "site-packages\calibre\gui2\preferences\plugins.py", line 387, in check_for_add_to_toolbars
File "site-packages\calibre\customize\__init__.py", line 543, in load_actual_plugin
AttributeError: 'module' object has no attribute 'CalculateSimilarityAction'
The reason for your error does not seem immediately obvious. However, I'll ask the usual software question: "Did you restart Calibre after instaling the plug-in?"

More experienced Calibre users may have a more informative view.
Ian_Stott is offline   Reply With Quote
Old 01-18-2012, 01:15 PM   #6
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
Quote:
Originally Posted by davidfor View Post
A "Reset Scores" option would be a good idea. Calculate the similarity for a set of books, then calculate it for a different set, the two sets of numbers don't have any relationship with each other. This could be a menu option or a setting to clear all the scores before doing a calculation.
A Reset Scores option is a good idea. I'll look into implementing it for the next release.

Quote:
Originally Posted by davidfor View Post
When I worked out what the plugin did (I didn't actually read the full post before trying it), the menu item "Find Similar Books..." started to bug me. It doesn't actually do this. What it does is "Calculate Similarity Scores". To do what you first paragraph states, and the menu item implies, the plugin would need to do the calculation on every book in the library and display a list that had scores greater than defined amount.
I agree that the plugin only calculates similarity scores for the set of books you select against your target. However, as the purpose of the tool is to help you find similar books. I felt that it was preferable that I chose a title that contained the desired goal, as opposed to one that contained more caveats than a financial advert on the TV. I hope you will bare with me as the tool heads towards that goal.
Ian_Stott is offline   Reply With Quote
Old 01-18-2012, 01:38 PM   #7
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
Quote:
Originally Posted by davidfor View Post
I thought I would compare Orson Scott Cards' "Ender's Game" and "Ender's Shadow". When I went to do this, I realised I had three different epub versions of Ender's Game, so I tried them first. They weren't very similar. One scored 0.333909 and the other "5.43363e-05". I did check the files and the they do contain the same text. But, the formatting is very different. Does this mean the comparison include the HTML code as well as the actual text of the book?
When extracting the text from a document, all of the HTML formatting is removed so that only the plain text is used.

Quote:
Originally Posted by davidfor View Post
As you mentioned the Harry Dresden series, I did a test comparing them to the first book. The results are similar to above. With "Euclid", they are all better than 0.9. With "Tanimoto" the closest is book 11 at 0.0295. I'm a little confused on this but it probably means that I don't understand how to interpret the scores properly.
One of the implications of using the TF-IDF method for describing the text of a document is that it focuses its importance upon the unusual words in a set of documents. The good side of this is that common words, eg for, of, him, her etc become irrelevant, as they occur in all (english) documents. The downside of this is that if you are comparing only 2 documents, only the words that are different between them will count and so the similarity score is likely to be low (especially for the tanimoto score).
However, if you select the books in the Ender series as well as a lot of other sci-fi books (eg all of the other books by Orson Scott Card), then you will find that the Ender books are scoring far higher.

When I did this with the Orson Scott Card books in my library, using Ender's Game as the target, Ender in exile came out top, with a tanimoto score of 0.51, Enders shadow at 0.23 and Speaker at 0.21.

A more satisfactory approach may be to replace the TF-IDF method with a some form of word count where the common words have been removed. However, this would require a dictionary that is language based. I have been ponted towards some python based text informatics libraries that would help with this - but I didn't want to launch into these for v1.
Ian_Stott is offline   Reply With Quote
Old 01-18-2012, 11:16 PM   #8
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 6,089
Karma: 6238033
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo Touch, Kobo Glo
Quote:
Originally Posted by Ian_Stott View Post
When extracting the text from a document, all of the HTML formatting is removed so that only the plain text is used.
I found this when I actually read the manual.

Quote:
One of the implications of using the TF-IDF method for describing the text of a document is that it focuses its importance upon the unusual words in a set of documents. The good side of this is that common words, eg for, of, him, her etc become irrelevant, as they occur in all (english) documents. The downside of this is that if you are comparing only 2 documents, only the words that are different between them will count and so the similarity score is likely to be low (especially for the tanimoto score).
However, if you select the books in the Ender series as well as a lot of other sci-fi books (eg all of the other books by Orson Scott Card), then you will find that the Ender books are scoring far higher.

When I did this with the Orson Scott Card books in my library, using Ender's Game as the target, Ender in exile came out top, with a tanimoto score of 0.51, Enders shadow at 0.23 and Speaker at 0.21.
My problem was that I was comparing three copies of Ender's Game and the scores were almost zero. These were epubs that had come from different sources. I think one was converted from a LIT, I'm not sure about the others. I did quick scan through them, and the text appeared to be the same.

Just now, I took a copy of one of these epubs, change the name of the file and the title using Sigil and added it to calibre. The similarity score for this new book was 0.0000354521. But, apart from one extra word in the metadata, the books are be the same.

Some of the other tests are OK. Comparing "Speaker for the Dead", "Xenocide" with "Children of the Mind" gave 0.24976 and 0.335613 respectively. Those scores make sense. But, my comparison of Ender's Game with Speaker and Shadow gives zero for both of them.

And now I am a little bit more baffled. I get different result if I compare two books, than if I compare more. Comparing all the above books individually to Game, gave a zero score for each. But, comparing them at the same time, gave scores between 0.001 and 0.094. I thought the comparison was always to the first selected book.

Quote:
A more satisfactory approach may be to replace the TF-IDF method with a some form of word count where the common words have been removed. However, this would require a dictionary that is language based. I have been ponted towards some python based text informatics libraries that would help with this - but I didn't want to launch into these for v1.
That is the problem with this sort of thing. There are lots of different ways to do it and you have to decide on one. Taking the simple approach at this point makes a lot of sense.
davidfor is offline   Reply With Quote
Old 01-19-2012, 04:02 PM   #9
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
Quote:
Originally Posted by davidfor View Post
My problem was that I was comparing three copies of Ender's Game and the scores were almost zero. These were epubs that had come from different sources. I think one was converted from a LIT, I'm not sure about the others. I did quick scan through them, and the text appeared to be the same.

Just now, I took a copy of one of these epubs, change the name of the file and the title using Sigil and added it to calibre. The similarity score for this new book was 0.0000354521. But, apart from one extra word in the metadata, the books are be the same.
I have submitted a minor update that contains 2 additional Similarity metrics. When comparing 2 or 3 books to see if they are the same, you could try the Tanimoto (binary) method. As this only looks at the presence or absence of words in the pair of documents, as opposed to a weighted word count, as with the TF-IDF method, this should have the following features:
  • The similarity score is only dependant upon the 2 books being compared, as opposed to the whole library under comparison.
  • If two books contain the same words, the similarity score will be 1.
  • As the word count is not used, purely the presence or absence of a word, this will be a far cruder measure.

Let me know if this helps with determiing if your 3 copies of Ender's Game are the same.
Ian_Stott is offline   Reply With Quote
Old 01-20-2012, 02:50 AM   #10
desideria
Member
desideria began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Device: none
Quote:
Originally Posted by Ian_Stott View Post
The reason for your error does not seem immediately obvious. However, I'll ask the usual software question: "Did you restart Calibre after instaling the plug-in?"

More experienced Calibre users may have a more informative view.
I couldn't install it, when I tried I've got this message and it said that it will uninstall the plugin. However, after I reinstall the windows, I changed to windows 7, I have no problems in installing the plugin.
desideria is offline   Reply With Quote
Old 01-23-2012, 05:54 PM   #11
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
Quote:
Originally Posted by desideria View Post
I changed to windows 7, I have no problems in installing the plugin.
I am pleased to learn that you have now managed to install the plugin. Could you let me know what operating system you were initally using, in case there is something about the plugin that is specific to Windows 7?
Ian_Stott is offline   Reply With Quote
Old 03-02-2012, 10:23 AM   #12
silentguy
Connoisseur
silentguy doesn't littersilentguy doesn't littersilentguy doesn't litter
 
Posts: 86
Karma: 200
Join Date: Nov 2010
Location: Dortmund, Germany
Device: Kindle PW2, Sony PRS-T1
For me this plugin creates a slight conflict with find_duplicates. When both are installed, find_duplicates displays "ERROR: Restart required: You must restart Calibre before using this plugin!".
I did some researching, and it seems that your plugin manages to overwrite some variables that find_duplicates uses, leading to fd searching for some of it's images in the fss zip archive.

It seems you are using his common_utils and this causes the conflict... maybe you could just talk with kiwidude about avoiding the conflics
silentguy is offline   Reply With Quote
Old 03-02-2012, 10:35 AM   #13
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@silentguy/Ian - I just downloaded your Similar Stories plugin to have a look at the source. The cause is undoubtedly just a copy/paste error in action.py. It should be referencing your *own* common_utils file, not the one in the find duplicates plugin.

i.e. Change this:
from calibre_plugins.find_duplicates.common_utils import set_plugin_icon_resources, get_icon, \

to this:
from calibre_plugins.similar_stories.common_utils import set_plugin_icon_resources, get_icon, \
kiwidude is offline   Reply With Quote
Old 03-08-2012, 02:32 PM   #14
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
Hi Folks,
Thanks for both letting me know of the issue and for identifying the issue. I'll fix it tonight and repost.
Ian_Stott is offline   Reply With Quote
Old 03-08-2012, 02:46 PM   #15
Ian_Stott
Junior Member
Ian_Stott began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2012
Device: Kindle
I have now released a new version with the identified by kiwidude. Hopefully this will solve the issue.
Ian_Stott is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] FanFictionDownLoader JimmXinu Plugins 3600 Yesterday 01:13 PM
[GUI Plugin] Open With kiwidude Plugins 251 12-15-2014 01:07 PM
[GUI Plugin] Find Duplicates kiwidude Plugins 510 11-16-2014 12:37 AM
How to find, in library, books similar to one on device? capnm Library Management 1 11-23-2011 07:24 PM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 01:27 PM


All times are GMT -4. The time now is 01:44 PM.


MobileRead.com is a privately owned, operated and funded community.