04-23-2011, 05:07 AM | #121 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Thx Idolse, yes similar titles does call that function so I can look to invoke that parameter. The discussion of tweaking the various algorithms to catch various cases is kind of topical given my post with the last beta. A few people have downloaded it over the last few days but I have no further feedback as yet which I would like before I start working on the next iteration of it.
How are people finding the ignore title search feature and the way it and exemptions are now working in the plugin? Are there any other bugs or unexpected behaviour you have noticed? Any further thoughts on the algorithms? I am convinced that similar author should be made more conservative as I specified above. What about the fuzzy author algorithms? How many would you like to see and what should they look like? Last edited by kiwidude; 04-23-2011 at 05:09 AM. |
04-23-2011, 06:42 AM | #122 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I actually missed the earlier post with the initial implementation of the ignore title option. I'll get it installed now and give it shot.
|
Advert | |
|
04-23-2011, 11:08 AM | #123 |
Member
Posts: 14
Karma: 10
Join Date: Sep 2010
Device: Kindle³
|
Additionally to the earlier report, I've found duplicates (manually) that I was unable to detect with the plugin.
Not surprisingly, parantheses Title Title (Remark, Year, etc) But there's also Title Title, or, alternative Title And Title Series N - Title Example: Foundation 5 - Foundation and Earth Foundation and Earth And, which is very weird: The Martian Way The Martian Way and other stories wasn't detected either. It also failed to find some titles with typos in it, e.g. Angle Angel My database currently holds 8128 books, I've been able to get rid of quite a few duplicates using all the available options so far. |
04-23-2011, 11:23 AM | #124 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Ahmed - thanks for the feedback and various examples. I am not at all surprised to see so many variations not covered by the existing logic. These betas have just plumbed into another existing piece of code in Calibre as a placeholder while the focus has been on the whole UI around searching, highlighting, navigating, exempting and management of groups of duplicates.
The tuning and refining of the "similar" and "fuzzy" algorithms is the "fun" bit I get to work on next. The adding of algorithms and tweaking them is all isolated code that is trivial to change without having to touch any of the more fragile core. Provided of course everyone is happy that the rest of it is working in a way they are happy with. Just one further point - your last example of angle / angel would only be found (possibly) by a soundex based algorithm as chaley has mentioned a few times. Is this something that is easily done from Python? |
04-23-2011, 12:22 PM | #125 | |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Note that soundex is by nature not conservative. For example, 'holly' and 'healey' generate equal soundex strings, also equaling the string generated by 'hilly' and 'hayley'. Note**2: that Knuth's algorithm works with any accuracy on words that use English pronunciation rules. I think that it a large enough 'market' to make it useful. |
|
Advert | |
|
04-23-2011, 01:05 PM | #126 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Wow, a simple bit of magic like that for soundex? Very cool, thx. I guess I could use the same approach as "similar title" as the starting point (stripping subtitles, punctuation etc) and then applying the soundex to that.
The question once again becomes the permutations... currently we have this: 1. Matching ISBN only 2. Identical title, ignore author 3. Similar title, ignore author 4. Similar title, identical author 5. Similar title, similar author* 6. Ignore title, similar author* for 5 & 6, as mentioned previously "similar author" is going to change to be more conservative to not ignore initials. We will add at least one more fuzzier author option (which for example looks at a surname plus first initial only) 7. Ignore title, fuzzy author Now we have soundex. Does it make sense to only apply it to titles rather than author names? As presumably you have the same problems of author initials etc causing problems with the results? So maybe we add: 8. Soundex title, similar author How does that sound? |
04-23-2011, 01:08 PM | #127 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Holy crap we just had the loudest thunderclap I have every heard in my life here in London, my eardrums are still ringing five minutes afterwards. Everything shook and car alarms are going for miles around. But I digress.
|
04-24-2011, 09:15 AM | #128 | |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Quote:
I have one question. It seems same title, ignore author does not completely ignore autor. 2 cases: 1. 1500 books comics of Donald Duck. 5 authors scan-time: less than half a minute result: 20 duplicates. 2. 500 books. 212 authors. scan-time: infinity? (processor stays on 100% (for one core) for longer than 12 hours). So it may be not complete ignore? |
|
04-24-2011, 09:23 AM | #129 | |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
It would let me run some debugging to see why you are getting this behaviour. When you say 500 books - is this a specific library, or have you applied a search restriction to limit the duplicates search? |
|
04-24-2011, 09:39 AM | #130 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
I will do this.
I have a db of 12 mb+ Did a check db. It cleaned 100 kb but this did not help in my case. I will send a pm to my db. Spoiler:
|
04-24-2011, 09:41 AM | #131 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
I should add, I have a 40,000 book library I use for testing - and it takes less than 2 seconds to run that search on it (returning 2000 duplicate groups). So your numbers are way out of line with what I am experiencing. Perhaps it is something to do with the titles or something. A copy of your .db to try to replicate the issue is the only way I can help at this point.
|
04-24-2011, 10:12 AM | #132 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
hope it will
|
04-24-2011, 11:24 AM | #133 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Ok, I found the problem thanks to the loan of drMerry's database. It is caused when you have a very large number of duplicates in the group. The problem isn't the search algorithms themselves but the code afterwards which repartitions the groups taking into account exemptions. The performance of this is diabolical in the situation of having say 70 duplicate titles in the group, so something badly wrong in there. Will post a new version when I figure out the exact cause and fix it or dump the approach in favour of another.
|
04-24-2011, 01:50 PM | #134 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
I got two other options I would like to have.
Identical Title, Similar Author Identical Title, Identical Author This would be a nice one for some quick searches Edit: maybe a checkbox same format? (to quickly find books that will be removed by merging (in case you want to free some space by merging)) Last edited by drMerry; 04-24-2011 at 01:54 PM. Reason: added an option |
04-24-2011, 01:50 PM | #135 | |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | Philosopher | Library Management | 114 | 09-08-2022 07:03 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |