New dictionary format of firmware 2.14 - Page 2

tshering · 11-01-2012, 12:44 PM

On my home computer, I have the same situation as you. There are the old dictionary folders, with the gzipped html files, and the new zip-files, containing encrypted html-files (I did not check all, but only the E-E dictionary, dicthtml.zip).

I did the last synchronization of my reader via my office computer, which I cannot access in the moment. On my Kobo, the dicthtml.zip contains gzipped html-files.

Interesting point is: The encrypted html-files in the dicthtml.zip of the desktop application (home computer) are dated 07.08.2012. The gzipped html-files of the dicthtml.zip of the KT are dated 13.10.2011 !!!!.

I just checked:
in 1.9.17 they are dated 16.03.2012
in 2.0.0 they are dated 09.07.2012

clsdclsd · 11-01-2012, 01:36 PM

Quote:

Originally Posted by tshering

The gzipped html-files of the dicthtml.zip of the KT are dated 13.10.2011 !!!!.

That's the 1.9.12's date - is that your "factory reset" firmware? Did you updated from that to the recent version?

It seems the new firmware version (which omits the dictionaries) does not get the new dictionaries automatically but creates them from the dictionaries already installed on the device. (Or get it from the recovery partition for some reason?)
At one point I removed all dictionaries from the reader via "Manage Dictionaries", and re-added them, so all my dictionaries are "brand new", encrypted.

Not closely related, I have tried to add my eng->hun dictionary to the device as an addition not a replacement, but did not succeeded. Adding a new line to the Dictionary table in the kobo.sqlite database resulted showing up the eng-hun dictionary in the "manage Dictionaries" but not as an actual choice at the dictionary selection for a word to translate/define :-(.

tshering · 11-01-2012, 03:25 PM

Quote:

Originally Posted by clsdclsd

That's the 1.9.12's date - is that your "factory reset" firmware? Did you updated from that to the recent version?

That's true.

Quote:

Originally Posted by clsdclsd

It seems the new firmware version (which omits the dictionaries) does not get the new dictionaries automatically but creates them from the dictionaries already installed on the device. (Or get it from the recovery partition for some reason?)
At one point I removed all dictionaries from the reader via "Manage Dictionaries", and re-added them, so all my dictionaries are "brand new", encrypted.

I removed and added again only some of them. That's why I have now a mixture of old and new dictionaries.

Quote:

Originally Posted by clsdclsd

Not closely related, I have tried to add my eng->hun dictionary to the device as an addition not a replacement, but did not succeeded. Adding a new line to the Dictionary table in the kobo.sqlite database resulted showing up the eng-hun dictionary in the "manage Dictionaries" but not as an actual choice at the dictionary selection for a word to translate/define :-(.

Some time ago I tried whether the desktop application accepts a further dictionary by adding a new folder. As was to expect, it did not. I wonder how the application knows which dictionaries to use. I was unable to identify a related entry in the registry. By the way, the application does not use the J-J dictionary. I even installed the desktop application from the Japanese site, but it is exactly the same.

Maybe also unrelated. When I went from 2.1.1 to 2.1.4 the J-J dictionary was definitly there. When I pointed at a Japanese word the Japanese dictionary entry popped up, but it did not appear as a choice at the dictionary selection for a word to define. As far as I remember, I modified the relevant information in the database, but the situation did not change. Only after checking it at the "manage dictionaries" screen and several synchronizations, was the J-J given as a choice.

tshering · 11-02-2012, 11:24 AM

I don't know what I did this time different from last time, but finally I got the J>E dictionary working. Thanks to mnjkl and clsdclsd for support.
Did anybody already have a look at MARISA (cf. post)? Would nice if we could add new dictionary entries.

AlPe · 11-02-2012, 06:22 PM

Quote:

Originally Posted by tshering

Did anybody already have a look at MARISA (cf. post)? Would nice if we could add new dictionary entries.

I confirm that "words" file is in MARISA 0.2 format.

You can enumerate the current keys using the executable named marisa-reverse-lookup, and requiring ID 0, 1, 2, ... (there is no better way, i.e., there is no method to dump the entire set of keys in the dictionary at once)

As far as I understand, there is no way of augmenting an existing dictionary with new keys. You have to store the previous set of keys, append the new keys, and create a new dictionary from the latter "augmented" set.

tshering · 11-02-2012, 08:25 PM

@AlPe
Thank you very much for the information. Right now, I am not sure whether to invest more time in the Japanese dictionary. As it is now, selecting text in a Japanese book is so cumbersome that using the dictionary is rather a pain. One can only hope that this improves with a future update.
I was hoping somebody else would go this way, so that I could easily follow his steps. Anyway, if I were to create a new dictionary (and at the moment, I don't have the necessary knowledge to do it) I would possibly also wish to replace the content (I mean the html files) completely. Thank you again.

murg · 11-02-2012, 09:26 PM

I've posted the direct links to the dictionaries in the Direct Links to Kobo Firmware thread.

tshering · 11-03-2012, 08:55 PM

I would like to say thanks to murg for maintaining the link list. This is really helpful.

Today I installed MinGW in order to compile MARISA 0.2.0 under Windows and try my hands at her tools. Would have been nice to fall in love with her. She didn't compile first. I found a related bug report and applied the proposed solution (link). After that she compiled. I wrote some lines of random text into a file "keyset.txt" and run marisa-benchmark and marisa-build against it. Both seemed to do their job, whatever their job exactly might be. Then I run all other tools, marisa-lookup and so on, against the dictionary "keyset.dic", which was produced by marisa-build. All of them reported the same error:

Quote:

marisa/grimoire/io/mapper.cc:99: MARISA_STATE_ERROR: !is_open(): failed to mmap a dictionary file: keyset.dic

I thought, maybe the culprit is my "keyset.txt" and ran the same tools against the "words" file of the English dictionary. The result was the same. Of course MARISA 0.2.0 is a rather young lady and evidently not much tested under windows. Therefore, I downloaded 0.1.5. The compilation failed with the same error message as 0.2.0. For this time, I gave up courting her.

tshering · 11-04-2012, 11:00 AM

As I reported in my last post on this threat, I was able to build a marisa dictionary but was unable to retrieve anything from a dictionary. "Dictionary" means here a highly compressed list of key-value pairs. This might not pass as a real definition, but might be good enough for our purposes. This kind of dictionary I will call here key-dictionary.

In the Kobo dictionaries (in order to prevent confusion I will call them language-dictionaries ) the key-dictionaries have the name "words". This "words" file is used to get the information whether an expression that is looked-up can be found in the respective language dictionary or not, and maybe some other information.

If we knew what the values of the key-value pairs consist of we could build our own "words" file. This again would enable us to insert new entries into the language dictionaries, or to build up a new dictionary from scratch. How the values look like should be easily ascertained with the marisa tools. However, I failed in my attempts. Therefore, I can only speculate about it.

1) In order to find out whether a certain word is in the language-dictionary it should be enough that the respective key is found in the key-dictionary. So we don't need any specific value.
2) In which html file is the looked-up expression located? Generally, it is located in a html file named after the first two letters of the expression. The word "body", for instance, is in the bo.html. In this case no further information is needed. No need for any specific value.
3) How are plural words, different verb forms, and so on handled? They are listed as variants under the main heading. We find for instance "bodies" listed as variant of "body" in bo.html. Still no need for any specific value.
3a) But what if the variant differs in the first two letters? We find for instance "went" as a variant of "go" in html.go. This could ask for a specific value. On could think of key="went" and value="go". This information would be sufficient to point the search engine to go.html. Is it done this way? Let us open the English dictionary screen of the KT and select it from the list. Surprise! It does not show the entry for "go", "went" has its own dictionary entry in we.html. Therefore, still no need for a specific value. Two bytes spared. In English, there are maybe not many variants of words that differ in the first two letters, and so this handling might pay off. But how is this in other languages, for instance German with its ablaut derivations? In ha.html of the German dictionary, we find, for instance, "hieb", "hiebest", "hiebet", "hiebe", "hiebst", gehauen", "hieben" as variants of "hauen". Are the all treated as individual entries? Let us open the German dictionary screen and type "hieb" and select any of the listed words. The first word, "hieb" gets us to the wrong entry "Hieb," in all other cases we read "No dictionary entry found for..." Evidently, the search engine searches in hi.html, whereas it should search in ha.html.

From these observations it seems to me likely that - at least in some of the language dictionaries - all values in the key-dictionary are empty or irrelevant.

AlPe · 11-04-2012, 12:56 PM

Quote:

Originally Posted by tshering

This "words" file is used to get the information whether an expression that is looked-up can be found in the respective language dictionary or not, and maybe some other information.

All you need to get from a query W to file "words" is: 1) which chunk (file .html(.gz)) contains the word W and its definition; 2) which is the "position" of word W in that chunk.

For 1), usually one assigns an ID to each chunk, like this: 11.html is 0, aa.html is 1, etc. in lexicographical order.
For 2), an easy way is to store, for word W, the offset, in bytes from the beginning of the chunk, where the definition of W starts.

(The dictionary is slit into several chunks to allow faster fetch-decompress-find operations)

See my analysis of the Cybook Odyssey dictionaries at: http://www.albertopettarin.it/penelope.html

Quote:

Originally Posted by tshering

If we knew what the values of the key-value pairs consist of we could build our own "words" file. This again would enable us to insert new entries into the language dictionaries, or to build up a new dictionary from scrap. How the values look like should be easily ascertained with the marisa tools. However, I failed in my attempts. Therefore, I can only speculate about it.

That's the point. I haven't had the chance of playing with the marisa-lib. Understanding the content of the (decompressed) index (file "words") is the key point there.

tshering · 11-04-2012, 03:47 PM

@AlPe
Thank you very much for your comments. I enjoyed much reading your article "Dictionaries for Bookeen Cybook Odyssey". Maybe I will study your script in order to start learning python.

Quote:

Originally Posted by AlPe

For 2), an easy way is to store, for word W, the offset, in bytes from the beginning of the chunk, where the definition of W starts.

This seems not to be the way the Kobo engine goes. If it were the case, mnjkl, clsdclsd and me could not have successfully manipulated the content of the dictionaries. I do not know how mnjkl and clsdclsd did it, I for one, did not replace the definitions by definitions of the exact same length. Rather, I added English text at the end of the Japanese definitions, thereby increasing each time the offset of all subsequent entries. As a further information, I can say that the position is not indicated by the node position (3rd child of the html-node or so). I inserted new siblings (<w>...</w>) and the subsequent siblings were still correctly accessed.

Therefore, my guess is that the position of a dictionary entry in the .html is determined by a simple text search for name="W". In that way both cases are coverd, the main head entry (<a name="go">), and the variant (<variant name="goes"/>).

From the behaviour of the Japanese dictionary I got the impression that there things are handled a little different. I still have to think it through. Most important of course is to get the marisa tools working.

AlPe · 11-04-2012, 03:55 PM

Ops, I missed what you previously did.

Your explanation makes sense: after loading the right chunk, they perform a search to locate the beginning of the definition.

mnjkl · 11-05-2012, 08:31 AM

It seems new firmware put dictionart into .kobo\dict.

AlPe · 11-05-2012, 11:50 AM

From what I see, the file "words" contains only the words stored in the dictionary (the "keys"), in several variants --- e.g., singular/plural for the Italian one.

Hence, file "words" can be used only to know whether a query word is present in the dictionary or not.

I think that the kobo software checks whether a word is present, then it matches the word with the chunk, and then it performs a full text search in the chunck to locate the beginning of the definition for the query word.

(quite inefficient process, in my opinion)

tshering · 11-05-2012, 01:03 PM

Quote:

Originally Posted by AlPe

From what I see, the file "words" contains only the words stored in the dictionary (the "keys"), in several variants --- e.g., singular/plural for the Italian one.

Thank you for sharing. Actually, I just found the same (had to install Ubuntu first). So I was rather close to the truth by saying "all values in the key-dictionary are empty or irrelevant."

Quote:

Originally Posted by AlPe

Hence, file "words" can be used only to know whether a query word is present in the dictionary or not.

They will use it also for populating the list of choices if you start typing in the search field of the dictionary screen.

As for the Japanese dictionary, there seem to be more steps involved for attesting whether a word is present in the dictionary. In Japanese, there are (at least) two ways of writing a word, in Kanji (logographic characters) and in Kana (phonological characters). In the flle "words", both kinds of writing an expression are put one after the other (Kana[Kanji]). As I understand marisa, it can find strings that match the search string exactly and strings that start with the search string. So in order to search for a Kanji in "words" it would be necessary to pair the Kanji with the Kana reading first.

11-04-2012, 11:00 AM	#24
tshering Wizard Posts: 3,489 Karma: 2914715 Join Date: Jun 2012 Device: kobo touch	As I reported in my last post on this threat, I was able to build a marisa dictionary but was unable to retrieve anything from a dictionary. "Dictionary" means here a highly compressed list of key-value pairs. This might not pass as a real definition, but might be good enough for our purposes. This kind of dictionary I will call here key-dictionary. In the Kobo dictionaries (in order to prevent confusion I will call them language-dictionaries ) the key-dictionaries have the name "words". This "words" file is used to get the information whether an expression that is looked-up can be found in the respective language dictionary or not, and maybe some other information. If we knew what the values of the key-value pairs consist of we could build our own "words" file. This again would enable us to insert new entries into the language dictionaries, or to build up a new dictionary from scratch. How the values look like should be easily ascertained with the marisa tools. However, I failed in my attempts. Therefore, I can only speculate about it. 1) In order to find out whether a certain word is in the language-dictionary it should be enough that the respective key is found in the key-dictionary. So we don't need any specific value. 2) In which html file is the looked-up expression located? Generally, it is located in a html file named after the first two letters of the expression. The word "body", for instance, is in the bo.html. In this case no further information is needed. No need for any specific value. 3) How are plural words, different verb forms, and so on handled? They are listed as variants under the main heading. We find for instance "bodies" listed as variant of "body" in bo.html. Still no need for any specific value. 3a) But what if the variant differs in the first two letters? We find for instance "went" as a variant of "go" in html.go. This could ask for a specific value. On could think of key="went" and value="go". This information would be sufficient to point the search engine to go.html. Is it done this way? Let us open the English dictionary screen of the KT and select it from the list. Surprise! It does not show the entry for "go", "went" has its own dictionary entry in we.html. Therefore, still no need for a specific value. Two bytes spared. In English, there are maybe not many variants of words that differ in the first two letters, and so this handling might pay off. But how is this in other languages, for instance German with its ablaut derivations? In ha.html of the German dictionary, we find, for instance, "hieb", "hiebest", "hiebet", "hiebe", "hiebst", gehauen", "hieben" as variants of "hauen". Are the all treated as individual entries? Let us open the German dictionary screen and type "hieb" and select any of the listed words. The first word, "hieb" gets us to the wrong entry "Hieb," in all other cases we read "No dictionary entry found for..." Evidently, the search engine searches in hi.html, whereas it should search in ha.html. From these observations it seems to me likely that - at least in some of the language dictionaries - all values in the key-dictionary are empty or irrelevant. Last edited by tshering; 11-06-2012 at 08:40 AM. Reason: Some corrections in: "In ha.html of the German dictionary,..."; replaced "from scrap" by "from scratch"

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What's file format of dictionary	mnjkl	Kobo Reader	2	12-12-2011 08:48 AM
Dictionary format	jgray	Sony Reader	1	10-25-2010 09:52 AM
English Thesaurus in the dictionary format	osnova	Amazon Kindle	14	12-12-2009 06:42 PM
Dictionary: what version? can it be in firmware?	jedix	Sony Reader Dev Corner	7	12-05-2008 12:00 PM
Webster dictionary in DEPReader format	abigail	Reading and Management	0	08-10-2005 08:00 AM

11-01-2012, 12:44 PM	#16
tshering Wizard Posts: 3,489 Karma: 2914715 Join Date: Jun 2012 Device: kobo touch	On my home computer, I have the same situation as you. There are the old dictionary folders, with the gzipped html files, and the new zip-files, containing encrypted html-files (I did not check all, but only the E-E dictionary, dicthtml.zip). I did the last synchronization of my reader via my office computer, which I cannot access in the moment. On my Kobo, the dicthtml.zip contains gzipped html-files. Interesting point is: The encrypted html-files in the dicthtml.zip of the desktop application (home computer) are dated 07.08.2012. The gzipped html-files of the dicthtml.zip of the KT are dated 13.10.2011 !!!!. I just checked: in 1.9.17 they are dated 16.03.2012 in 2.0.0 they are dated 09.07.2012

11-02-2012, 11:24 AM	#19
tshering Wizard Posts: 3,489 Karma: 2914715 Join Date: Jun 2012 Device: kobo touch	I don't know what I did this time different from last time, but finally I got the J>E dictionary working. Thanks to mnjkl and clsdclsd for support. Did anybody already have a look at MARISA (cf. post)? Would nice if we could add new dictionary entries.

11-02-2012, 08:25 PM	#21
tshering Wizard Posts: 3,489 Karma: 2914715 Join Date: Jun 2012 Device: kobo touch	@AlPe Thank you very much for the information. Right now, I am not sure whether to invest more time in the Japanese dictionary. As it is now, selecting text in a Japanese book is so cumbersome that using the dictionary is rather a pain. One can only hope that this improves with a future update. I was hoping somebody else would go this way, so that I could easily follow his steps. Anyway, if I were to create a new dictionary (and at the moment, I don't have the necessary knowledge to do it) I would possibly also wish to replace the content (I mean the html files) completely. Thank you again.

11-02-2012, 09:26 PM	#22
murg No Comment Posts: 3,238 Karma: 23878043 Join Date: Jan 2012 Location: Australia Device: Kobo: Not just an eReader, it's an adventure!	I've posted the direct links to the dictionaries in the Direct Links to Kobo Firmware thread.

11-04-2012, 03:55 PM	#27
AlPe Digital Amanuensis Posts: 727 Karma: 1446357 Join Date: Dec 2011 Location: Turin, Italy Device: Several eReaders and tablets	Ops, I missed what you previously did. Your explanation makes sense: after loading the right chunk, they perform a search to locate the beginning of the definition.

11-05-2012, 08:31 AM	#28
mnjkl Member Posts: 11 Karma: 4264 Join Date: Dec 2011 Device: kobo touch	It seems new firmware put dictionart into .kobo\dict.

11-05-2012, 11:50 AM	#29
AlPe Digital Amanuensis Posts: 727 Karma: 1446357 Join Date: Dec 2011 Location: Turin, Italy Device: Several eReaders and tablets	From what I see, the file "words" contains only the words stored in the dictionary (the "keys"), in several variants --- e.g., singular/plural for the Italian one. Hence, file "words" can be used only to know whether a query word is present in the dictionary or not. I think that the kobo software checks whether a word is present, then it matches the word with the chunk, and then it performs a full text search in the chunck to locate the beginning of the definition for the query word. (quite inefficient process, in my opinion)

Advert

Advert