![]() |
#31 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
|
Quote:
Quote:
As for the Japanese dictionary, there seem to be more steps involved for attesting whether a word is present in the dictionary seem. In Japanese, there are (at least) two ways of writing a word, in Kanji (logographic characters) and in Kana (phonological characters). In the flle "words", both kinds of writing are put one after the other (Kana[Kanji]). As I understand marisa, it can find strings that match the search string exactly and also strings that start with the search string. So in order to search for a Kanji in "words" it would be necessary to pair the Kanji with the Kana reading first. |
||
![]() |
![]() |
![]() |
#32 |
Digital Amanuensis
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
|
Here you go.
I found out strange things in the process, I will elaborate on them after a couple of additional tests. BTW, it is important that you compress the files as follows: Code:
$ zip dicthtml-XX.zip *html words Code:
$ zip dicthtml-XX.zip * It also seems irrelevant whether the "Size" field in the "Dictionary" table of KoboReader.sqlite matches the actual size of your file or not. Similarily, the "LastUpdate" field seems to be ignored. (I guess they use it only for update purposes). Last edited by AlPe; 11-05-2012 at 01:14 PM. |
![]() |
![]() |
![]() |
#33 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
|
|
![]() |
![]() |
![]() |
#34 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
|
Right! If you went from 1.9.17 directly to 2.1.5 (as I understand from your description) it should have been so from that moment. Or did you upgrade in several steps?
|
![]() |
![]() |
![]() |
#35 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
|
I think I am now able to build a Japanese dictionary for the KT from scratch. The major advantage over manipulating the original dictionary (e.g., by adding English definitions) is that one can add new entries. By this the usability would increase drastically for me. However, the way dictionary queries are handled limit the usefulness of any Japanese dictionary. The developers of the search function concentrated heavily on the Kana-version. The problem in this is that Japanese has a lot of homonyms. Two words with different meanings but same pronunciation can in almost all cases be distinguished by their different Kanji writings, however not if they a written in Kana.
I did a short test. There are for instance the Kanjis 他, 多 and 田. They can all have the pronunciation "ta" and therefore the same Kana writing. So what does the KT show if one searches for these Kanjis? 1) For 他 (meaning "other") the KT displays the entry for ほかほか, which is pronounced hokahoka and has the meaning "very hot (food)"). 2) For both 多 ("many") and 田 ("rice paddy") the KT displays the entry for "た" which is sort of auxiliary. In all three cases the wrong identification is caused by the decision to first replace the Kanji with one (of several possible) Kana-writings and search then for the Kana. |
![]() |
![]() |
![]() |
#36 |
Digital Amanuensis
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
|
EDIT: please ignore this post. As explained below, the "strange thing" was due to the timestamping mechanism of gzip.
One strange thing that I noticed is that some of the gzipped chunks in a original Kobo dictionary, (the files ending in ".html") seems not to be gzipped with gzip or they have been altered after the compression. I tried to uncompress a couple of them, and re-compress them with gzip and dictzip. The latter generates a file completely different than the original, so I exclude it was used by Kobo for their original files. On the other hand, re-compressing with gzip leads in 2 cases out of 7 to a file which has some bytes different than the original file. Example: Original file, taken from the Kobo dictionary: Code:
$ hexdump -Cv original.si.html | head -n10 00000000 1f 8b 08 08 1d 12 67 50 00 03 73 69 2e 68 74 6d |......gP..si.htm| 00000010 6c 00 c4 fd 4d af 24 d9 95 25 8a fd 15 2f 0e 14 |l...M.$..%.../..| 00000020 37 d0 9e 41 66 91 d5 5d 95 c1 ce 42 30 f2 43 f1 |7..Af..]...B0.C.| 00000030 94 99 cc ce 48 12 85 16 34 38 d7 dc dc ef 69 9a |....H...48....i.| 00000040 db 71 9a b9 79 57 dd 81 40 bc 9e 14 a0 1a 74 43 |.q..yW..@.....tC| 00000050 52 a3 89 d7 0d 54 e3 01 7a 64 23 05 09 59 83 04 |R....T..zd#..Y..| 00000060 1f 34 ca 90 06 6c 02 ef fd 06 f1 97 68 af b5 f7 |.4...l......h...| 00000070 39 76 cc dc cc af f9 cd a8 16 50 c5 0c f7 eb ee |9v........P.....| 00000080 66 76 3e f6 d9 1f 6b af f5 e3 bf fc eb 7d b5 3a |fv>...k......}.:| 00000090 95 4d eb 43 fd cf bf f7 ee b3 1f 7c 6f 55 d6 45 |.M.C.......|oU.E| Code:
$ hexdump -Cv si.html | head -n10 00000000 1f 8b 08 08 6e 6a 99 50 00 03 73 69 2e 68 74 6d |....nj.P..si.htm| 00000010 6c 00 c4 fd 4d af 24 d9 95 25 8a fd 15 2f 0e 14 |l...M.$..%.../..| 00000020 37 d0 9e 41 66 91 d5 5d 95 c1 ce 42 30 f2 43 f1 |7..Af..]...B0.C.| 00000030 94 99 cc ce 48 12 85 16 34 38 d7 dc dc ef 69 9a |....H...48....i.| 00000040 db 71 9a b9 79 57 dd 81 40 bc 9e 14 a0 1a 74 43 |.q..yW..@.....tC| 00000050 52 a3 89 d7 0d 54 e3 01 7a 64 23 05 09 59 83 04 |R....T..zd#..Y..| 00000060 1f 34 ca 90 06 6c 02 ef fd 06 f1 97 68 af b5 f7 |.4...l......h...| 00000070 39 76 cc dc cc af f9 cd a8 16 50 c5 0c f7 eb ee |9v........P.....| 00000080 66 76 3e f6 d9 1f 6b af f5 e3 bf fc eb 7d b5 3a |fv>...k......}.:| 00000090 95 4d eb 43 fd cf bf f7 ee b3 1f 7c 6f 55 d6 45 |.M.C.......|oU.E| Code:
$ ls -l si.html original.si.html -rw-r--r-- 1 xyz xyz 80555 Nov 6 20:52 original.si.html -rw-r--r-- 1 xyz xyz 80555 Nov 6 20:52 si.html Last edited by AlPe; 11-06-2012 at 03:30 PM. |
![]() |
![]() |
![]() |
#37 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,344
Karma: 78876004
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
When I look at Wikipedia on gzip it mentions the header containing a timestamp.....
Quote:
|
|
![]() |
![]() |
![]() |
#38 | |
Digital Amanuensis
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
|
Yeah, I was looking exactly at it...
http://tools.ietf.org/html/rfc1952#section-2.2 The different bytes are exactly the timestamp, which is obviously different --- since I "modified" the file while decompressing-adding .gz-recompressing. Still, in the process, I learnt that the dictionary was created under *nix, if the OS field was properly set ![]() Now I wonder what I did to get 5 out of 7 to recompress EXACTLY with the same timestamp... mmm... the MTIME field description says: Quote:
Edit: CONFIRMED, this was the reason! Meh! Last edited by AlPe; 11-06-2012 at 03:29 PM. |
|
![]() |
![]() |
![]() |
#39 |
Digital Amanuensis
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
|
Ok, here is the recipe for creating your own dictionary.
1) Create a working directory, say /tmp/mydict/ 2) In /tmp/mydict/, create as many XX.html files are needed, where XX are the first two letters of each word being defined in XX.html. Use 11.html for all non-starting-with-a-letter-words. The syntax for each such .html file is as follows: Code:
<?xml version="1.0" encoding="utf-8"?> <html> <w> <p><a name="WORD"/>DEFINITION OF WORD, you <b>can</b> use HTML tags. </p> </w> </html> 3) gzip all these .html files individually, removing the ".gz" extension after compressing them 4) build a text file with one word per line, say "index.txt", and create the index with: Code:
$ ./marisa-build index.txt > words Code:
$ zip ../dicthtml-LL.zip *html words 6) copy the resulting zip file to .kobo/dict/ and you are done! (You might want to change the dictionary to another one and then back to the newly created, so that the index "words" is reloaded.) Last edited by AlPe; 11-07-2012 at 04:42 PM. |
![]() |
![]() |
![]() |
#40 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,178
Karma: 2431850
Join Date: Sep 2008
Device: IPad Mini 2 Retina
|
AlPe, I have tried following your instructions, but purely within Windows. So I have built marisa 0.2.0 using Visual Studio 2008. I have created an aa.html file like this:
Code:
<?xml version="1.0" encoding="utf-8"?> <html> <w> <p><a name="Aardvark"/>A South American animal.</p> </w> <w> <p><a name="Aardman"/>Animators.</p> </w> </html> Code:
Aardvark Aardman I used 7zip to gzip the aa.html file: 7z a -tgzip aa.html src\aa.html I used marisa-build to create my words file: marisa-build src\index.txt > words And then I used 7Zip to zip the aa.html (gzip format) and words: 7z a -tzip dicttest.zip aa.html words I put the dicttest.zip into .kobo\dict, but when I go into the Settings, Language, Dictionary, Edit, my dictionary is not listed. Any thoughts? |
![]() |
![]() |
![]() |
#41 |
Digital Amanuensis
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
|
Either you replace (create a backup copy first!) an existing dictionary (and then it suffices to call it dicthtml-LL.zip, with a suitable LL string, as explained in Step 5) or you need to edit the Dictionary table in .kobo/KoboReader.sqlite indicating the new name, I guess.
BTW, the default monolingual English dictionary is named "dicthtml.zip". |
![]() |
![]() |
![]() |
#42 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
|
Quote:
Last edited by tshering; 11-13-2012 at 02:48 PM. |
|
![]() |
![]() |
![]() |
#43 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,178
Karma: 2431850
Join Date: Sep 2008
Device: IPad Mini 2 Retina
|
Ah I see...well it half worked! I don't think my index is working, because it seems to still be using the index of the original dictionary, but when I look up Aardvark, I get my definition, and not the original.
Thinks....but I see I need to switch dictionaries for the new index to kick in... |
![]() |
![]() |
![]() |
#44 | ||
Digital Amanuensis
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#45 | |
Digital Amanuensis
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
What's file format of dictionary | mnjkl | Kobo Reader | 2 | 12-12-2011 08:48 AM |
Dictionary format | jgray | Sony Reader | 1 | 10-25-2010 09:52 AM |
English Thesaurus in the dictionary format | osnova | Amazon Kindle | 14 | 12-12-2009 06:42 PM |
Dictionary: what version? can it be in firmware? | jedix | Sony Reader Dev Corner | 7 | 12-05-2008 12:00 PM |
Webster dictionary in DEPReader format | abigail | Reading and Management | 0 | 08-10-2005 08:00 AM |