New dictionary format of firmware 2.14 - Page 3

tshering · 11-05-2012, 01:08 PM

Quote:

Originally Posted by AlPe

From what I see, the file "words" contains only the words stored in the dictionary (the "keys"), in several variants --- e.g., singular/plural for the Italian one.

Thank you for sharing. Actually, I just found the same (had to install Ubuntu first). So I was rather close to the truth by saying "all values in the key-dictionary are empty or irrelevant."

Quote:

Originally Posted by AlPe

Hence, file "words" can be used only to know whether a query word is present in the dictionary or not.

They possibly use it also for populating the list of choices if you start typing in the search field of the dictionary screen.

As for the Japanese dictionary, there seem to be more steps involved for attesting whether a word is present in the dictionary seem. In Japanese, there are (at least) two ways of writing a word, in Kanji (logographic characters) and in Kana (phonological characters). In the flle "words", both kinds of writing are put one after the other (Kana[Kanji]). As I understand marisa, it can find strings that match the search string exactly and also strings that start with the search string. So in order to search for a Kanji in "words" it would be necessary to pair the Kanji with the Kana reading first.

AlPe · 11-05-2012, 01:10 PM

Here you go.

I found out strange things in the process, I will elaborate on them after a couple of additional tests.

BTW, it is important that you compress the files as follows:

Code:

$ zip dicthtml-XX.zip *html words

and NOT

Code:

$ zip dicthtml-XX.zip *

because it seems important that words is added as the last file to the ZIP.

It also seems irrelevant whether the "Size" field in the "Dictionary" table of KoboReader.sqlite matches the actual size of your file or not. Similarily, the "LastUpdate" field seems to be ignored. (I guess they use it only for update purposes).

tshering · 11-05-2012, 03:40 PM

Quote:

Originally Posted by AlPe

I found out strange things in the process, I will elaborate on them after a couple of additional tests.

I'm very curious to hear about those strange things.

tshering · 11-06-2012, 07:23 AM

Quote:

Originally Posted by mnjkl

It seems new firmware put dictionart into .kobo\dict.

Right! If you went from 1.9.17 directly to 2.1.5 (as I understand from your description) it should have been so from that moment. Or did you upgrade in several steps?

tshering · 11-06-2012, 08:35 AM

I think I am now able to build a Japanese dictionary for the KT from scratch. The major advantage over manipulating the original dictionary (e.g., by adding English definitions) is that one can add new entries. By this the usability would increase drastically for me. However, the way dictionary queries are handled limit the usefulness of any Japanese dictionary. The developers of the search function concentrated heavily on the Kana-version. The problem in this is that Japanese has a lot of homonyms. Two words with different meanings but same pronunciation can in almost all cases be distinguished by their different Kanji writings, however not if they a written in Kana.
I did a short test. There are for instance the Kanjis 他, 多 and 田. They can all have the pronunciation "ta" and therefore the same Kana writing. So what does the KT show if one searches for these Kanjis?
1) For 他 (meaning "other") the KT displays the entry for ほかほか, which is pronounced hokahoka and has the meaning "very hot (food)").
2) For both 多 ("many") and 田 ("rice paddy") the KT displays the entry for "た" which is sort of auxiliary.
In all three cases the wrong identification is caused by the decision to first replace the Kanji with one (of several possible) Kana-writings and search then for the Kana.

AlPe · 11-06-2012, 02:58 PM

EDIT: please ignore this post. As explained below, the "strange thing" was due to the timestamping mechanism of gzip.

One strange thing that I noticed is that some of the gzipped chunks in a original Kobo dictionary, (the files ending in ".html") seems not to be gzipped with gzip or they have been altered after the compression.

I tried to uncompress a couple of them, and re-compress them with gzip and dictzip. The latter generates a file completely different than the original, so I exclude it was used by Kobo for their original files.

On the other hand, re-compressing with gzip leads in 2 cases out of 7 to a file which has some bytes different than the original file. Example:

Original file, taken from the Kobo dictionary:

Code:

$ hexdump -Cv original.si.html | head -n10
00000000  1f 8b 08 08 1d 12 67 50  00 03 73 69 2e 68 74 6d  |......gP..si.htm|
00000010  6c 00 c4 fd 4d af 24 d9  95 25 8a fd 15 2f 0e 14  |l...M.$..%.../..|
00000020  37 d0 9e 41 66 91 d5 5d  95 c1 ce 42 30 f2 43 f1  |7..Af..]...B0.C.|
00000030  94 99 cc ce 48 12 85 16  34 38 d7 dc dc ef 69 9a  |....H...48....i.|
00000040  db 71 9a b9 79 57 dd 81  40 bc 9e 14 a0 1a 74 43  |.q..yW..@.....tC|
00000050  52 a3 89 d7 0d 54 e3 01  7a 64 23 05 09 59 83 04  |R....T..zd#..Y..|
00000060  1f 34 ca 90 06 6c 02 ef  fd 06 f1 97 68 af b5 f7  |.4...l......h...|
00000070  39 76 cc dc cc af f9 cd  a8 16 50 c5 0c f7 eb ee  |9v........P.....|
00000080  66 76 3e f6 d9 1f 6b af  f5 e3 bf fc eb 7d b5 3a  |fv>...k......}.:|
00000090  95 4d eb 43 fd cf bf f7  ee b3 1f 7c 6f 55 d6 45  |.M.C.......|oU.E|

The same file, decompressed and recompressed:

Code:

$ hexdump -Cv si.html | head -n10
00000000  1f 8b 08 08 6e 6a 99 50  00 03 73 69 2e 68 74 6d  |....nj.P..si.htm|
00000010  6c 00 c4 fd 4d af 24 d9  95 25 8a fd 15 2f 0e 14  |l...M.$..%.../..|
00000020  37 d0 9e 41 66 91 d5 5d  95 c1 ce 42 30 f2 43 f1  |7..Af..]...B0.C.|
00000030  94 99 cc ce 48 12 85 16  34 38 d7 dc dc ef 69 9a  |....H...48....i.|
00000040  db 71 9a b9 79 57 dd 81  40 bc 9e 14 a0 1a 74 43  |.q..yW..@.....tC|
00000050  52 a3 89 d7 0d 54 e3 01  7a 64 23 05 09 59 83 04  |R....T..zd#..Y..|
00000060  1f 34 ca 90 06 6c 02 ef  fd 06 f1 97 68 af b5 f7  |.4...l......h...|
00000070  39 76 cc dc cc af f9 cd  a8 16 50 c5 0c f7 eb ee  |9v........P.....|
00000080  66 76 3e f6 d9 1f 6b af  f5 e3 bf fc eb 7d b5 3a  |fv>...k......}.:|
00000090  95 4d eb 43 fd cf bf f7  ee b3 1f 7c 6f 55 d6 45  |.M.C.......|oU.E|

While...

Code:

$ ls -l si.html original.si.html 
-rw-r--r-- 1 xyz xyz 80555 Nov  6 20:52 original.si.html
-rw-r--r-- 1 xyz xyz 80555 Nov  6 20:52 si.html

As you can see, the different bytes are in the "header" of the gzip file. Perhaps some particular option of gzip must be invoked when compressing. I am not sure whether this is actually a problem for the functionality of the resulting dictionary, as I have not test it yet.

PeterT · 11-06-2012, 03:04 PM

When I look at Wikipedia on gzip it mentions the header containing a timestamp.....

Quote:

a 10-byte header, containing a magic number, a version number and a timestamp

Presumably that in itself would explain the difference.

AlPe · 11-06-2012, 03:16 PM

Yeah, I was looking exactly at it...

http://tools.ietf.org/html/rfc1952#section-2.2

The different bytes are exactly the timestamp, which is obviously different --- since I "modified" the file while decompressing-adding .gz-recompressing.

Still, in the process, I learnt that the dictionary was created under *nix, if the OS field was properly set

Now I wonder what I did to get 5 out of 7 to recompress EXACTLY with the same timestamp... mmm... the MTIME field description says:

Quote:

MTIME (Modification TIME)
This gives the most recent modification time of the original
file being compressed. The time is in Unix format, i.e.,
seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this
may cause problems for MS-DOS and other systems that use
local rather than Universal time.) If the compressed data
did not come from a file, MTIME is set to the time at which
compression started. MTIME = 0 means no time stamp is
available.

Ok, perhaps I forced gunzip to ignore the fact that si.html did not have .gz extension, and recompress it "as it was", hence retaining the original MTIME.
Edit: CONFIRMED, this was the reason! Meh!

AlPe · 11-07-2012, 04:40 PM

Ok, here is the recipe for creating your own dictionary.

1) Create a working directory, say /tmp/mydict/

2) In /tmp/mydict/, create as many XX.html files are needed, where XX are the first two letters of each word being defined in XX.html. Use 11.html for all non-starting-with-a-letter-words. The syntax for each such .html file is as follows:

Code:

<?xml version="1.0" encoding="utf-8"?>
<html>
<w>
<p><a name="WORD"/>DEFINITION OF WORD, you <b>can</b> use HTML tags. </p>
</w>
</html>

with as many <w> elements as needed. You can also use variants, see one of the original files for that.

3) gzip all these .html files individually, removing the ".gz" extension after compressing them

4) build a text file with one word per line, say "index.txt", and create the index with:

Code:

$ ./marisa-build index.txt > words

5) compress the whole thing with:

Code:

$ zip ../dicthtml-LL.zip *html words

where LL is the dictionary language (LL="en" or "it" or "fr" ecc.)

6) copy the resulting zip file to .kobo/dict/ and you are done!

(You might want to change the dictionary to another one and then back to the newly created, so that the index "words" is reloaded.)

ShellShock · 11-13-2012, 02:29 PM

AlPe, I have tried following your instructions, but purely within Windows. So I have built marisa 0.2.0 using Visual Studio 2008. I have created an aa.html file like this:

Code:

<?xml version="1.0" encoding="utf-8"?>
<html>
<w>
<p><a name="Aardvark"/>A South American animal.</p>
</w>
<w>
<p><a name="Aardman"/>Animators.</p>
</w>
</html>

And my index.txt file looks like this:

Code:

Aardvark
Aardman

I used Notepad++ to create these files, in Dos\Windows ANSI format.

I used 7zip to gzip the aa.html file:

7z a -tgzip aa.html src\aa.html

I used marisa-build to create my words file:

marisa-build src\index.txt > words

And then I used 7Zip to zip the aa.html (gzip format) and words:

7z a -tzip dicttest.zip aa.html words

I put the dicttest.zip into .kobo\dict, but when I go into the Settings, Language, Dictionary, Edit, my dictionary is not listed. Any thoughts?

AlPe · 11-13-2012, 02:38 PM

Either you replace (create a backup copy first!) an existing dictionary (and then it suffices to call it dicthtml-LL.zip, with a suitable LL string, as explained in Step 5) or you need to edit the Dictionary table in .kobo/KoboReader.sqlite indicating the new name, I guess.

BTW, the default monolingual English dictionary is named "dicthtml.zip".

tshering · 11-13-2012, 02:45 PM

Quote:

Originally Posted by ShellShock

7z a -tzip dicttest.zip aa.html words
I put the dicttest.zip into .kobo\dict, but when I go into the Settings, Language, Dictionary, Edit, my dictionary is not listed. Any thoughts?

The name of the zip file must be one of those that Kobo already uses, for instance dichtml.zip in the case of an English-English dictionary. Using a different name seems not to work (link).

ShellShock · 11-13-2012, 02:52 PM

Ah I see...well it half worked! I don't think my index is working, because it seems to still be using the index of the original dictionary, but when I look up Aardvark, I get my definition, and not the original.

Thinks....but I see I need to switch dictionaries for the new index to kick in...

AlPe · 11-13-2012, 02:58 PM

Quote:

Originally Posted by ShellShock

Ah I see...well it half worked! I don't think my index is working, because it seems to still be using the index of the original dictionary, but when I look up Aardvark, I get my definition, and not the original.

Thinks....but I see I need to switch dictionaries for the new index to kick in...

Yeah, that was what I warned about:

Quote:

(You might want to change the dictionary to another one and then back to the newly created, so that the index "words" is reloaded.)

AlPe · 11-13-2012, 03:00 PM

Quote:

Originally Posted by tshering

The name of the zip file must be one of those that Kobo already uses, for instance dichtml.zip in the case of an English-English dictionary. Using a different name seems not to work (link).

I guess that editing the KoboReader.sqlite file will make them work with arbitrarty names. I do not have my Kobo with me to confirm, though.

11-05-2012, 01:10 PM	#32
AlPe Digital Amanuensis Posts: 727 Karma: 1446357 Join Date: Dec 2011 Location: Turin, Italy Device: Several eReaders and tablets	Here you go. I found out strange things in the process, I will elaborate on them after a couple of additional tests. BTW, it is important that you compress the files as follows: Code: $ zip dicthtml-XX.zip html words and NOT Code: $ zip dicthtml-XX.zip because it seems important that words is added as the last file to the ZIP. It also seems irrelevant whether the "Size" field in the "Dictionary" table of KoboReader.sqlite matches the actual size of your file or not. Similarily, the "LastUpdate" field seems to be ignored. (I guess they use it only for update purposes). Attached Thumbnails Last edited by AlPe; 11-05-2012 at 01:14 PM.

11-07-2012, 04:40 PM	#39
AlPe Digital Amanuensis Posts: 727 Karma: 1446357 Join Date: Dec 2011 Location: Turin, Italy Device: Several eReaders and tablets	Ok, here is the recipe for creating your own dictionary. 1) Create a working directory, say /tmp/mydict/ 2) In /tmp/mydict/, create as many XX.html files are needed, where XX are the first two letters of each word being defined in XX.html. Use 11.html for all non-starting-with-a-letter-words. The syntax for each such .html file is as follows: Code: <?xml version="1.0" encoding="utf-8"?> <html> <w> <p><a name="WORD"/>DEFINITION OF WORD, you <b>can</b> use HTML tags. </p> </w> </html> with as many <w> elements as needed. You can also use variants, see one of the original files for that. 3) gzip all these .html files individually, removing the ".gz" extension after compressing them 4) build a text file with one word per line, say "index.txt", and create the index with: Code: $ ./marisa-build index.txt > words 5) compress the whole thing with: Code: $ zip ../dicthtml-LL.zip html words where LL is the dictionary language (LL="en" or "it" or "fr" ecc.) 6) copy the resulting zip file to .kobo/dict/ and you are done! (You might want to change the dictionary to another one and then back to the newly created, so that the index "words" is reloaded.) Last edited by AlPe; 11-07-2012 at 04:42 PM.*

11-13-2012, 02:29 PM	#40
ShellShock Wizard Posts: 1,185 Karma: 2431850 Join Date: Sep 2008 Device: IPad Mini 2 Retina	AlPe, I have tried following your instructions, but purely within Windows. So I have built marisa 0.2.0 using Visual Studio 2008. I have created an aa.html file like this: Code: <?xml version="1.0" encoding="utf-8"?> <html> <w> <p><a name="Aardvark"/>A South American animal.</p> </w> <w> <p><a name="Aardman"/>Animators.</p> </w> </html> And my index.txt file looks like this: Code: Aardvark Aardman I used Notepad++ to create these files, in Dos\Windows ANSI format. I used 7zip to gzip the aa.html file: 7z a -tgzip aa.html src\aa.html I used marisa-build to create my words file: marisa-build src\index.txt > words And then I used 7Zip to zip the aa.html (gzip format) and words: 7z a -tzip dicttest.zip aa.html words I put the dicttest.zip into .kobo\dict, but when I go into the Settings, Language, Dictionary, Edit, my dictionary is not listed. Any thoughts?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What's file format of dictionary	mnjkl	Kobo Reader	2	12-12-2011 08:48 AM
Dictionary format	jgray	Sony Reader	1	10-25-2010 09:52 AM
English Thesaurus in the dictionary format	osnova	Amazon Kindle	14	12-12-2009 06:42 PM
Dictionary: what version? can it be in firmware?	jedix	Sony Reader Dev Corner	7	12-05-2008 12:00 PM
Webster dictionary in DEPReader format	abigail	Reading and Management	0	08-10-2005 08:00 AM

11-06-2012, 08:35 AM	#35
tshering Wizard Posts: 3,489 Karma: 2914715 Join Date: Jun 2012 Device: kobo touch	I think I am now able to build a Japanese dictionary for the KT from scratch. The major advantage over manipulating the original dictionary (e.g., by adding English definitions) is that one can add new entries. By this the usability would increase drastically for me. However, the way dictionary queries are handled limit the usefulness of any Japanese dictionary. The developers of the search function concentrated heavily on the Kana-version. The problem in this is that Japanese has a lot of homonyms. Two words with different meanings but same pronunciation can in almost all cases be distinguished by their different Kanji writings, however not if they a written in Kana. I did a short test. There are for instance the Kanjis 他, 多 and 田. They can all have the pronunciation "ta" and therefore the same Kana writing. So what does the KT show if one searches for these Kanjis? 1) For 他 (meaning "other") the KT displays the entry for ほかほか, which is pronounced hokahoka and has the meaning "very hot (food)"). 2) For both 多 ("many") and 田 ("rice paddy") the KT displays the entry for "た" which is sort of auxiliary. In all three cases the wrong identification is caused by the decision to first replace the Kanji with one (of several possible) Kana-writings and search then for the Kana.

11-13-2012, 02:38 PM	#41
AlPe Digital Amanuensis Posts: 727 Karma: 1446357 Join Date: Dec 2011 Location: Turin, Italy Device: Several eReaders and tablets	Either you replace (create a backup copy first!) an existing dictionary (and then it suffices to call it dicthtml-LL.zip, with a suitable LL string, as explained in Step 5) or you need to edit the Dictionary table in .kobo/KoboReader.sqlite indicating the new name, I guess. BTW, the default monolingual English dictionary is named "dicthtml.zip".

11-13-2012, 02:52 PM	#43
ShellShock Wizard Posts: 1,185 Karma: 2431850 Join Date: Sep 2008 Device: IPad Mini 2 Retina	Ah I see...well it half worked! I don't think my index is working, because it seems to still be using the index of the original dictionary, but when I look up Aardvark, I get my definition, and not the original. Thinks....but I see I need to switch dictionaries for the new index to kick in...