Quote:
Originally Posted by jorm
ok that makes sense for songs. So if we want to handle anthologies we have to go with approach 2 proper names and take the count for the entire book. That seems like that would solve it. In my case most of my books are not anthologies that it is not picking up. Just ones with really poor tagging.
Does approach 2 sound workable? Any obvious logic failures that we can forsee? Next is can we get people to run it on their books and populate our database.
|
I think, one must choose data of a book, that is void of formatting/layout. I think of a single string of characters, where all other characters like white spaces are stripped off.
...
OK, I browsed a while. The single string idea brought me to DNS sequence analysis, then string searching algorithms like the Rabin–Karp algorithm and then what seems to me
the solution (which might be too complex for one to implement, but there are perhaps some open source frameworks), because it represents the same use case:
Plagiarism detection
What we are talking here about like finding the first sentence, taking nouns etc. is the
document model.