Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 04-23-2011, 05:07 AM   #121
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Thx Idolse, yes similar titles does call that function so I can look to invoke that parameter. The discussion of tweaking the various algorithms to catch various cases is kind of topical given my post with the last beta. A few people have downloaded it over the last few days but I have no further feedback as yet which I would like before I start working on the next iteration of it.

How are people finding the ignore title search feature and the way it and exemptions are now working in the plugin?

Are there any other bugs or unexpected behaviour you have noticed?

Any further thoughts on the algorithms? I am convinced that similar author should be made more conservative as I specified above. What about the fuzzy author algorithms? How many would you like to see and what should they look like?

Last edited by kiwidude; 04-23-2011 at 05:09 AM.
kiwidude is offline   Reply With Quote
Old 04-23-2011, 06:42 AM   #122
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I actually missed the earlier post with the initial implementation of the ignore title option. I'll get it installed now and give it shot.
ldolse is offline   Reply With Quote
Advert
Old 04-23-2011, 11:08 AM   #123
[Ahmed]
Member
[Ahmed] began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Sep 2010
Device: Kindle³
Additionally to the earlier report, I've found duplicates (manually) that I was unable to detect with the plugin.

Not surprisingly, parantheses
Title
Title (Remark, Year, etc)

But there's also

Title
Title, or, alternative Title


And

Title
Series N - Title

Example:
Foundation 5 - Foundation and Earth
Foundation and Earth

And, which is very weird:

The Martian Way
The Martian Way and other stories

wasn't detected either.


It also failed to find some titles with typos in it, e.g.
Angle
Angel

My database currently holds 8128 books, I've been able to get rid of quite a few duplicates using all the available options so far.
[Ahmed] is offline   Reply With Quote
Old 04-23-2011, 11:23 AM   #124
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@Ahmed - thanks for the feedback and various examples. I am not at all surprised to see so many variations not covered by the existing logic. These betas have just plumbed into another existing piece of code in Calibre as a placeholder while the focus has been on the whole UI around searching, highlighting, navigating, exempting and management of groups of duplicates.

The tuning and refining of the "similar" and "fuzzy" algorithms is the "fun" bit I get to work on next. The adding of algorithms and tweaking them is all isolated code that is trivial to change without having to touch any of the more fragile core. Provided of course everyone is happy that the rest of it is working in a way they are happy with.

Just one further point - your last example of angle / angel would only be found (possibly) by a soundex based algorithm as chaley has mentioned a few times. Is this something that is easily done from Python?
kiwidude is offline   Reply With Quote
Old 04-23-2011, 12:22 PM   #125
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
Just one further point - your last example of angle / angel would only be found (possibly) by a soundex based algorithm as chaley has mentioned a few times. Is this something that is easily done from Python?
Soundex is easy to compute. See http://code.activestate.com/recipes/...dex-algorithm/. My approach would be to parse the item to words, eliminate all punctuation, compute the soundex of each word, add each word to a set. A 'conservative' comparison would compare for set equality. Less conservative would check for N matches out of M words.

Note that soundex is by nature not conservative. For example, 'holly' and 'healey' generate equal soundex strings, also equaling the string generated by 'hilly' and 'hayley'.

Note**2: that Knuth's algorithm works with any accuracy on words that use English pronunciation rules. I think that it a large enough 'market' to make it useful.
chaley is offline   Reply With Quote
Advert
Old 04-23-2011, 01:05 PM   #126
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Wow, a simple bit of magic like that for soundex? Very cool, thx. I guess I could use the same approach as "similar title" as the starting point (stripping subtitles, punctuation etc) and then applying the soundex to that.

The question once again becomes the permutations... currently we have this:
1. Matching ISBN only
2. Identical title, ignore author
3. Similar title, ignore author
4. Similar title, identical author
5. Similar title, similar author*
6. Ignore title, similar author*

for 5 & 6, as mentioned previously "similar author" is going to change to be more conservative to not ignore initials. We will add at least one more fuzzier author option (which for example looks at a surname plus first initial only)
7. Ignore title, fuzzy author

Now we have soundex. Does it make sense to only apply it to titles rather than author names? As presumably you have the same problems of author initials etc causing problems with the results? So maybe we add:
8. Soundex title, similar author

How does that sound?
kiwidude is offline   Reply With Quote
Old 04-23-2011, 01:08 PM   #127
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Holy crap we just had the loudest thunderclap I have every heard in my life here in London, my eardrums are still ringing five minutes afterwards. Everything shook and car alarms are going for miles around. But I digress.
kiwidude is offline   Reply With Quote
Old 04-24-2011, 09:15 AM   #128
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Quote:
Originally Posted by kiwidude View Post
Wow, a simple bit of magic like that for soundex? Very cool, thx. I guess I could use the same approach as "similar title" as the starting point (stripping subtitles, punctuation etc) and then applying the soundex to that.

The question once again becomes the permutations... currently we have this:
1. Matching ISBN only
2. Identical title, ignore author
3. Similar title, ignore author
4. Similar title, identical author
5. Similar title, similar author*
6. Ignore title, similar author*

for 5 & 6, as mentioned previously "similar author" is going to change to be more conservative to not ignore initials. We will add at least one more fuzzier author option (which for example looks at a surname plus first initial only)
7. Ignore title, fuzzy author

Now we have soundex. Does it make sense to only apply it to titles rather than author names? As presumably you have the same problems of author initials etc causing problems with the results? So maybe we add:
8. Soundex title, similar author

How does that sound?
Sounds good.
I have one question. It seems same title, ignore author does not completely ignore autor.

2 cases:
1. 1500 books comics of Donald Duck. 5 authors scan-time: less than half a minute result: 20 duplicates.
2. 500 books. 212 authors. scan-time: infinity? (processor stays on 100% (for one core) for longer than 12 hours).
So it may be not complete ignore?
drMerry is offline   Reply With Quote
Old 04-24-2011, 09:23 AM   #129
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by drMerry View Post
Sounds good.
I have one question. It seems same title, ignore author does not completely ignore autor.

2 cases:
1. 1500 books comics of Donald Duck. 5 authors scan-time: less than half a minute result: 20 duplicates.
2. 500 books. 212 authors. scan-time: infinity? (processor stays on 100% (for one core) for longer than 12 hours).
So it may be not complete ignore?
Hmmm. It definitely ignores the author, so there must be something else going on. Any chance you could zip up your metadata.db and send me a PM with a link to it somewhere? (I don't need/want the books, just the database).

It would let me run some debugging to see why you are getting this behaviour. When you say 500 books - is this a specific library, or have you applied a search restriction to limit the duplicates search?
kiwidude is offline   Reply With Quote
Old 04-24-2011, 09:39 AM   #130
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
I will do this.
I have a db of 12 mb+

Did a check db. It cleaned 100 kb but this did not help in my case.
I will send a pm to my db.

Spoiler:
I was just looking at the code.
Not an expert in your code, but if it IS in the code, maybe it is on this point

class TitleAuthorAlgorithm(AlgorithmBase):
Spoiler:
'''
This algorithm is used for all the permutations requiring
some evaluation of book titles and an optional author evaluation
'''
def __init__(self, gui, book_exemptions_map, title_eval, author_eval=None):
AlgorithmBase.__init__(self, gui, exemptions_map=book_exemptions_map)
self._title_eval = title_eval
self._author_eval = author_eval

def find_candidate(self, book_id, candidates_map):
title_key = self._title_eval(self.db.title(book_id, index_is_id=True))
author_key = ''
if self._author_eval:
author_key = self._author_eval(authors_to_list(self.db, book_id))
candidates_map[title_key+author_key].add(book_id)
drMerry is offline   Reply With Quote
Old 04-24-2011, 09:41 AM   #131
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
I should add, I have a 40,000 book library I use for testing - and it takes less than 2 seconds to run that search on it (returning 2000 duplicate groups). So your numbers are way out of line with what I am experiencing. Perhaps it is something to do with the titles or something. A copy of your .db to try to replicate the issue is the only way I can help at this point.
kiwidude is offline   Reply With Quote
Old 04-24-2011, 10:12 AM   #132
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
hope it will
drMerry is offline   Reply With Quote
Old 04-24-2011, 11:24 AM   #133
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Ok, I found the problem thanks to the loan of drMerry's database. It is caused when you have a very large number of duplicates in the group. The problem isn't the search algorithms themselves but the code afterwards which repartitions the groups taking into account exemptions. The performance of this is diabolical in the situation of having say 70 duplicate titles in the group, so something badly wrong in there. Will post a new version when I figure out the exact cause and fix it or dump the approach in favour of another.
kiwidude is offline   Reply With Quote
Old 04-24-2011, 01:50 PM   #134
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
I got two other options I would like to have.
Identical Title, Similar Author
Identical Title, Identical Author

This would be a nice one for some quick searches

Edit: maybe a checkbox same format? (to quickly find books that will be removed by merging (in case you want to free some space by merging))

Last edited by drMerry; 04-24-2011 at 01:54 PM. Reason: added an option
drMerry is offline   Reply With Quote
Old 04-24-2011, 01:50 PM   #135
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
Ok, I found the problem thanks to the loan of drMerry's database. It is caused when you have a very large number of duplicates in the group. The problem isn't the search algorithms themselves but the code afterwards which repartitions the groups taking into account exemptions. The performance of this is diabolical in the situation of having say 70 duplicate titles in the group, so something badly wrong in there. Will post a new version when I figure out the exact cause and fix it or dump the approach in favour of another.
One thing that has bitten me is using the 'in' operator on lists. The operator does a linear search! One piece of code I wrote improved in performance by two orders of magnitude when I changed the list to a set, which does hashed lookups. Sometimes I use a dict with a fixed value (e.g. True) for the same thing, because they are hashed as well.
chaley is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplicate Detection Philosopher Library Management 114 09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM
Duplicate Detection albill Calibre 2 10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin cgranade Plugins 3 09-15-2010 12:11 PM
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 04:56 AM


All times are GMT -4. The time now is 02:04 AM.


MobileRead.com is a privately owned, operated and funded community.