View Single Post
Old 03-13-2012, 12:24 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,450
Karma: 27757438
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Quote:
Originally Posted by jorm View Post
You mentioned the set of proper nouns are you thinking of doing a lookup with a list of proper nouns found in the beginning of the book? I think using that to identify a sentence might work well but if we just used the nouns without the sentence we would probably get confused with multiple books in a series about the same characters (nouns).
The problem with sentences is how can you be sure you're past the front matter. Most ebooks do not mark their formtmatter, its just part of the body text.

I doubt we could come up with any reasonable scheme that would work across all books 100% of the time. The idea would be to come up with something that is 1) computationally cheap 2) much smaller than the book itself 3) Fairly robust.

I vaguely recall reading about fingerprinting for audio tracks. Which suggests some kind of statistical analysis of the text. Set of proper nouns, histogram of word frequncies (keep only the bottom 20 or so), average sentence length, number of punctuation marks, that kind of thing.
kovidgoyal is offline   Reply With Quote