i agree finding a consistent sentence or paragraph is the hard part. If we could do that consistently or at least 90% of the time we could probably complete this in a few days. Sweeping through the entire book counting proper names and tracking those counts may work but how fuzzy would we have to make that. What if there was an error in conversion and we lost a space between FirstName and said like FirstNamesaid. The count could be off.
So we have two approaches.
1. Try to get first paragraph or sentence.
Challenge : Might have a difficult time finding the first paragraph since we have TOC, headers, Copyrights etc.... Possible might not be 100% accurate.
2. Count of Proper nouns and maybe a couple of key word frequencies.
Challenge : how much leeway do we put here. If a conversion was not perfect would the count be off and we would not find a match? Do we only do this to the first x pages to limit processing power to determine this?
I can see the benefits of both approaches. The second is cleaner in the respect of we can process the header as well. However if one book has a header and the other copy does not we might not match.
However in that case someone else might have tagged it and we can find a match using that pattern as well. So if 80% of the time we capture the consistent magic sentence or paragraph the other 20% of the time we don't if one of those 20% of the time someone tagged that book we will have that sentence in our database as well.
I am open to both approaches just want to get a feedback on the best approach and move forward.
|