MobileRead Forums - View Single Post

Fanas · 03-15-2012, 12:15 PM

Small dictionary of words is the way to go, there are words in any language that are used often enough to appear many times in any book, obviously 1-3 letter words would appear to often so those would be off limits. But any 4-5 letter long words would be both abundant and yet not so much as to make hashing way too resource consuming. Though someone else with more know-how should decide on algorithms to be used for identifying. I say, take the hash and scan through book until you find 2 consecutive occurrences of the word having same amount of letters in-between. When found you see if the next occurrence is correct as well, then you repeat until there's mismatch. After you finish scanning the book you've got more or less the right idea of how identical the two texts are.

03-15-2012, 12:15 PM	#32
Fanas Member Posts: 21 Karma: 12 Join Date: Aug 2009 Device: none	Small dictionary of words is the way to go, there are words in any language that are used often enough to appear many times in any book, obviously 1-3 letter words would appear to often so those would be off limits. But any 4-5 letter long words would be both abundant and yet not so much as to make hashing way too resource consuming. Though someone else with more know-how should decide on algorithms to be used for identifying. I say, take the hash and scan through book until you find 2 consecutive occurrences of the word having same amount of letters in-between. When found you see if the next occurrence is correct as well, then you repeat until there's mismatch. After you finish scanning the book you've got more or less the right idea of how identical the two texts are.