MobileRead Forums - View Single Post

compurandom · 12-15-2022, 08:01 PM

> b) Databases of word frequency tables can become very large, very quickly.

I wouldn't think you need a complete dictionary to do this.

I would expect that having a dictionary of, say, the top 400 words in a language would be plenty to characterize it.

If you were selective, you could probably even pick less than 50 "keystone" words that are not shared with other languages, or at least very frequent in one language and very infrequent in other languages and come up with a correct weighted answer.

I'd even guess (i.e., without research or evidence) that given two languages, you could pick 10 words in each that would distinguish a text between the two using a weighted frequency sample of a few pages randomly selected in the book (i.e., page 10, not page 1, and a page full of words, not pictures).

I'm sure in the hundreds to thousands of potential languages, you could probably come up with a small number of words that would assign a book to a language family, and then go down a decision tree to narrow down which one from the family.

Even without having a database, it should be possible to analyze a book, generate a frequency table of the top ~1000 words, have the user supply the language, and build a database. After adding a handful of languages like this, you could start characterizing books and for ones that are wrong, it could generate a differential between the two languages. A user guided selection of words might be useful and improve accuracy, but likely not totally necessary.

12-15-2022, 08:01 PM	#1694
compurandom Wizard Posts: 1,017 Karma: 500000 Join Date: Jun 2015 Device: Rocketbook, kobo aura h2o, kobo forma, kobo libra color	> b) Databases of word frequency tables can become very large, very quickly. I wouldn't think you need a complete dictionary to do this. I would expect that having a dictionary of, say, the top 400 words in a language would be plenty to characterize it. If you were selective, you could probably even pick less than 50 "keystone" words that are not shared with other languages, or at least very frequent in one language and very infrequent in other languages and come up with a correct weighted answer. I'd even guess (i.e., without research or evidence) that given two languages, you could pick 10 words in each that would distinguish a text between the two using a weighted frequency sample of a few pages randomly selected in the book (i.e., page 10, not page 1, and a page full of words, not pictures). I'm sure in the hundreds to thousands of potential languages, you could probably come up with a small number of words that would assign a book to a language family, and then go down a decision tree to narrow down which one from the family. Even without having a database, it should be possible to analyze a book, generate a frequency table of the top ~1000 words, have the user supply the language, and build a database. After adding a handful of languages like this, you could start characterizing books and for ones that are wrong, it could generate a differential between the two languages. A user guided selection of words might be useful and improve accuracy, but likely not totally necessary. Last edited by compurandom; 12-15-2022 at 08:06 PM.