View Single Post
Old 11-04-2012, 12:00 PM   #24
tshering
Wizard
tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.
 
Posts: 1,428
Karma: 365536
Join Date: Jun 2012
Device: kobo touch
As I reported in my last post on this threat, I was able to build a marisa dictionary but was unable to retrieve anything from a dictionary. "Dictionary" means here a highly compressed list of key-value pairs. This might not pass as a real definition, but might be good enough for our purposes. This kind of dictionary I will call here key-dictionary.

In the Kobo dictionaries (in order to prevent confusion I will call them language-dictionaries ) the key-dictionaries have the name "words". This "words" file is used to get the information whether an expression that is looked-up can be found in the respective language dictionary or not, and maybe some other information.

If we knew what the values of the key-value pairs consist of we could build our own "words" file. This again would enable us to insert new entries into the language dictionaries, or to build up a new dictionary from scratch. How the values look like should be easily ascertained with the marisa tools. However, I failed in my attempts. Therefore, I can only speculate about it.

1) In order to find out whether a certain word is in the language-dictionary it should be enough that the respective key is found in the key-dictionary. So we don't need any specific value.
2) In which html file is the looked-up expression located? Generally, it is located in a html file named after the first two letters of the expression. The word "body", for instance, is in the bo.html. In this case no further information is needed. No need for any specific value.
3) How are plural words, different verb forms, and so on handled? They are listed as variants under the main heading. We find for instance "bodies" listed as variant of "body" in bo.html. Still no need for any specific value.
3a) But what if the variant differs in the first two letters? We find for instance "went" as a variant of "go" in html.go. This could ask for a specific value. On could think of key="went" and value="go". This information would be sufficient to point the search engine to go.html. Is it done this way? Let us open the English dictionary screen of the KT and select it from the list. Surprise! It does not show the entry for "go", "went" has its own dictionary entry in we.html. Therefore, still no need for a specific value. Two bytes spared. In English, there are maybe not many variants of words that differ in the first two letters, and so this handling might pay off. But how is this in other languages, for instance German with its ablaut derivations? In ha.html of the German dictionary, we find, for instance, "hieb", "hiebest", "hiebet", "hiebe", "hiebst", gehauen", "hieben" as variants of "hauen". Are the all treated as individual entries? Let us open the German dictionary screen and type "hieb" and select any of the listed words. The first word, "hieb" gets us to the wrong entry "Hieb," in all other cases we read "No dictionary entry found for..." Evidently, the search engine searches in hi.html, whereas it should search in ha.html.

From these observations it seems to me likely that - at least in some of the language dictionaries - all values in the key-dictionary are empty or irrelevant.

Last edited by tshering; 11-06-2012 at 09:40 AM. Reason: Some corrections in: "In ha.html of the German dictionary,..."; replaced "from scrap" by "from scratch"
tshering is offline   Reply With Quote