Quote:
Originally Posted by kiwidude
Wow, a simple bit of magic like that for soundex? Very cool, thx. I guess I could use the same approach as "similar title" as the starting point (stripping subtitles, punctuation etc) and then applying the soundex to that.
The question once again becomes the permutations... currently we have this:
1. Matching ISBN only
2. Identical title, ignore author
3. Similar title, ignore author
4. Similar title, identical author
5. Similar title, similar author*
6. Ignore title, similar author*
for 5 & 6, as mentioned previously "similar author" is going to change to be more conservative to not ignore initials. We will add at least one more fuzzier author option (which for example looks at a surname plus first initial only)
7. Ignore title, fuzzy author
Now we have soundex. Does it make sense to only apply it to titles rather than author names? As presumably you have the same problems of author initials etc causing problems with the results? So maybe we add:
8. Soundex title, similar author
How does that sound?
|
Sounds good.
I have one question. It seems same title, ignore author does not completely ignore autor.
2 cases:
1. 1500 books comics of Donald Duck. 5 authors scan-time: less than half a minute result: 20 duplicates.
2. 500 books. 212 authors. scan-time: infinity? (processor stays on 100% (for one core) for longer than 12 hours).
So it may be not complete ignore?