View Single Post
Old 11-22-2012, 06:38 AM   #79
tshering
Wizard
tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.tshering ought to be getting tired of karma fortunes by now.
 
Posts: 3,489
Karma: 2914715
Join Date: Jun 2012
Device: kobo touch
Quote:
Originally Posted by shutramp View Post
So if I may ask regarding a Chinese version again: if I understood tsherings explanations correctly and did not misread earlier posts about dict-crafting, a dict of the size of the CC-CEDICT would simply be an overkill with its more than 100.000 entries. At least with the current Kobo-dict system.
The problem is not the 100.000 entries. The problem is the great number of different characters. According to the current rules, a name of an html-file must consist of two!! characters (the first two characters of the definiendum/definenda contained in the file, or the first character plus "a" in case that the definiendum consists only of one character). I skip the finer details here. The CJK block of the unicode contains more than 20.000 characters. Granted that not each and every combination of two characters does actually occur in the language, the resulting number of needed files is nevertheless rather large.


Quote:
Originally Posted by shutramp View Post
But I wonder if one could try to compile a very very basic dict for Chinese with only a few thousand html entries/files. Let's say single words/characters only, like "字" which would be then a "字.html"-file using the actual character for file-name. And there would be a file for every single character like "子.html" and "字.html" and "自.html" and so on. And every single file would only contain this one characters definitions, right? So then every html file itself would be very small too.

This would allow to at least look up the spelling and meaning of the basic few thousand single Chinese characters. And it would probably have a reasonable size. Or am I mistaken?
This should be possible. The files have to be named "字a.html" and so on. 20.000 files should not be a problem. The maximum number of files in a zip file is about 65.500. I made a test with 14.000 files, and they worked perfectly.
I plan to split the Japanese dictionary into two parts. The first dictionary will contain words that start with one of the 2136 most commonly used Kanjis (resulting in 64.032 files), the second will contain words that start with Kana and depending on the number of resulting files maybe some more Kanjis. For Chinese you might perhaps do a similar split. The size of the individual files are not a big problem in this case. Maximum size of a zip file is 4 gigabytes.

Quote:
Originally Posted by shutramp View Post
And hopefully the matching/definition-pop-up would work just fine by means of the underlying Unicode/UTF-8 coding for every character? I know that the selecting and looking-up works for I have tried, but it lacks any result for definition right now. Oh and I would probably need to somehow make the ebook have let's say french as default language if the Chinese dict would be named as dicthtml-fr, to get the correct pop-up result, wouldn't I?
I guess you are right. One more thing: I am not sure whether all Chinese characters are supported by the Japanese fonts that the Touch/Glow uses. I would rather doubt it. However, this seems not to be a major problem as long as one can read the English definition.
tshering is offline   Reply With Quote