EDIT: I wrote at the same time as Tshering, essentially saying the same.
Quote:
Originally Posted by Simpetus
Anyways, thanks to another user (tshering) I found out the reason Kobo's system keep saying that there is no dictionary installed - the amount of html files in zip archive of the final version of the custom Chinese dictionary is too high - almost 120 000 html files.
|
That's because the reverse engineering effort has started from the Kobo dictionaries on latin scripts. In them each .html file inside the ZIP container (which is an HTML file, GZIPed and renamed removing the .gz extension) contains all the words stating with the same 2 characters prefix. Plus, there is some special rule for 1 character words. See:
https://github.com/pettarin/penelope...t_kobo.py#L132
Now, when I made Penelope I assumed that that was the Kobo convention. Apparently, as Tshering's dictionary shows, the Kobo firmware seems to deal with an arbitrary partitioning of the index into prefixes (.html files).
Of course grouping together by the first char leads to less .html files than grouping together by the first two chars. And this is especially true for languages like Chinese or Japanese where there are thousands of "characters".
When you lookup a word using a Kobo dictionary, there are three steps:
1. lookup in the MARISA trie ("words"), which essentially decides whether the word is present in the dictionary; assuming the word is present:
2. locating the .html file inside the ZIP container which contains the definition;
3. gunzipping the .html file, parsing the uncompressed HTML contents, locate the actual definition to show to the user.
The splitting policies essentially decides a tradeoff between 2. and 3. If you group by 1 char prefix, you will have a faster 2., but a slower 3. than grouping by 2 char prefix. I guess the 2 char prefix rule followed by Kobo for languages with latin scripts is the result of some experimentation, and they found it to be optimal for latin scripts. Since grouping by k prefix leads to order n^k .html files (where n is the number of base character), it works for latin script languages, as n~30, k~2 => ~900 pairs (usually much less, since not all possible prefixes occur in the language). As mentioned above, k=2 leads to explosion for languages like Chinese or Japanese.
Quote:
Originally Posted by Simpetus
So, even if you mentioned that, it seems there are no restrictions in file size, there are some about amount of html files in archive.
|
It might be related to the fact that ZIP archive can have a maximum number of entries and/or it might be due to a limitation in the ZIP library that Kobo is using. Again, without documentation is not really possible to know, other than trying to build increasingly large dictionaries (with an increasing number of .html files), and find out the number of .html entries that makes the dictionary not working.
Quote:
Originally Posted by Simpetus
Later I extract all files in archive, deleted almost all of them, packed it again and put in the dictionary - it worked.
Any workarounds? tshering said to put all dictionary articles with similar characters into one html file, combine them, but:
1. I do not know how to automate the process;
2. And If I do it manually, with 120 000 html files in the archive it will take ungodly amount of time.
|
Currently Penelope uses the "two chars" rule described above. But I can easily add a command line option to let the user specify the length of the prefix for grouping. I'll do it during the weekend.