Quote:
Originally Posted by jorm
You mentioned the set of proper nouns are you thinking of doing a lookup with a list of proper nouns found in the beginning of the book? I think using that to identify a sentence might work well but if we just used the nouns without the sentence we would probably get confused with multiple books in a series about the same characters (nouns).
|
The problem with sentences is how can you be sure you're past the front matter. Most ebooks do not mark their formtmatter, its just part of the body text.
I doubt we could come up with any reasonable scheme that would work across all books 100% of the time. The idea would be to come up with something that is 1) computationally cheap 2) much smaller than the book itself 3) Fairly robust.
I vaguely recall reading about fingerprinting for audio tracks. Which suggests some kind of statistical analysis of the text. Set of proper nouns, histogram of word frequncies (keep only the bottom 20 or so), average sentence length, number of punctuation marks, that kind of thing.