View Single Post
Old 03-13-2012, 02:31 PM   #20
Backi
Connoisseur
Backi has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead communityBacki has become a pillar of the MobileRead community
 
Backi's Avatar
 
Posts: 99
Karma: 15776
Join Date: Dec 2011
Device: PB912 Matt White
Quote:
Originally Posted by jorm View Post
ok that makes sense for songs. So if we want to handle anthologies we have to go with approach 2 proper names and take the count for the entire book. That seems like that would solve it. In my case most of my books are not anthologies that it is not picking up. Just ones with really poor tagging.

Does approach 2 sound workable? Any obvious logic failures that we can forsee? Next is can we get people to run it on their books and populate our database.
I think, one must choose data of a book, that is void of formatting/layout. I think of a single string of characters, where all other characters like white spaces are stripped off.

...

OK, I browsed a while. The single string idea brought me to DNS sequence analysis, then string searching algorithms like the Rabin–Karp algorithm and then what seems to me the solution (which might be too complex for one to implement, but there are perhaps some open source frameworks), because it represents the same use case:

Plagiarism detection

What we are talking here about like finding the first sentence, taking nouns etc. is the document model.
Backi is offline   Reply With Quote