View Full Version : dictionaries and languages


Jellby
11-21-2008, 02:27 PM
Hi all,

I have two French dictionaries in mobipocket format, and some French books in my Cybook. The problem is when I look up a word in a French book, only one of the dictionaries is searched, while the other seems to be used for English books (I've had some matches from this other dictionary when looking up a word in English).

I understand this is probably a matter of language settings, however both dictionaries seem to have the same language, according to mobi2mobi:

MOBIHEADER language: 1036 - 12 - 1 - FRENCH -

while the French books are created with html2mobi, and have:

MOBIHEADER language: 12 - 12 - 0 - FRENCH -

I didn't find any significant difference between the two dictionaries. The "right" one seems to be utf8-encoded, larger, and I have it in the SD card. The "wrong" one is latin1-encoded, smaller, and it's in the main Cybook memory. (I still haven't tried moving them.) Any ideas on how to fix the "wrong" one?

tompe
11-21-2008, 04:06 PM
Hi all,

I have two French dictionaries in mobipocket format, and some French books in my Cybook. The problem is when I look up a word in a French book, only one of the dictionaries is searched, while the other seems to be used for English books (I've had some matches from this other dictionary when looking up a word in English).

I understand this is probably a matter of language settings, however both dictionaries seem to have the same language, according to mobi2mobi:

MOBIHEADER language: 1036 - 12 - 1 - FRENCH -

while the French books are created with html2mobi, and have:

MOBIHEADER language: 12 - 12 - 0 - FRENCH -

I didn't find any significant difference between the two dictionaries. The "right" one seems to be utf8-encoded, larger, and I have it in the SD card. The "wrong" one is latin1-encoded, smaller, and it's in the main Cybook memory. (I still haven't tried moving them.) Any ideas on how to fix the "wrong" one?

The language code setting consists of two parts. The first (12) is the main laguage. The second part is the "sub language" and they are:

1 => "FRENCH",
2 => "FRENCH_BELGIAN",
3 => "FRENCH_CANADIAN",
4 => "FRENCH_SWISS",
5 => "FRENCH_LUXEMBOURG",
6 => "FRENCH_MONACO",

The printout from mobi2mobi is "code - main language - sub language".

DaleDe
11-21-2008, 06:24 PM
Hi all,

I have two French dictionaries in mobipocket format, and some French books in my Cybook. The problem is when I look up a word in a French book, only one of the dictionaries is searched, while the other seems to be used for English books (I've had some matches from this other dictionary when looking up a word in English).

I understand this is probably a matter of language settings, however both dictionaries seem to have the same language, according to mobi2mobi:

MOBIHEADER language: 1036 - 12 - 1 - FRENCH -

while the French books are created with html2mobi, and have:

MOBIHEADER language: 12 - 12 - 0 - FRENCH -

I didn't find any significant difference between the two dictionaries. The "right" one seems to be utf8-encoded, larger, and I have it in the SD card. The "wrong" one is latin1-encoded, smaller, and it's in the main Cybook memory. (I still haven't tried moving them.) Any ideas on how to fix the "wrong" one?

A problem related to formats or cybook would probably be better handled in the Cybook group but as I recall the Cybook needs the dictionary to be in the main memory, not on a card.

Dale

AZed
11-23-2008, 12:43 PM
The language code setting consists of two parts. The first (12) is the main laguage. The second part is the "sub language" and they are:

1 => "FRENCH",
2 => "FRENCH_BELGIAN",
3 => "FRENCH_CANADIAN",
4 => "FRENCH_SWISS",
5 => "FRENCH_LUXEMBOURG",
6 => "FRENCH_MONACO",

The printout from mobi2mobi is "code - main language - sub language".

Huh, you're stripping the bottom two bits from the region code in that example. I'd noticed that they always seemed to be zero, but I hadn't found any particular reason to separate them. I see that the mobi2mobi output is still using the full byte, however.

I also note that the language code of '1036' isn't even valid, and the number makes me think that mobi2mobi has a bad language parser -- 1036 breaks down into 1024+12, meaning that the parser is pulling more than one byte for the language code, and not correctly separating the unknown value. Language code 12, region code 12 is "French (Canada)", however.

I'm more interested by the fact that there is a nonzero unknown value at all, though. Where did you obtain this e-book, and is it freely redistributable (or at least cheap)? I'd be interested in seeing what the EBook::Tools (http://search.cpan.org/dist/EBook-Tools/) parser makes of it. My offhanded guess is that while the main language is set correctly on one of the dictionaries, the dictionary language values are wrong. (There are actually three language codes embedded -- one for the main language, one for the dictionary input language, and one for the dictionary output language.)

Jellby
11-23-2008, 01:16 PM
Where did you obtain this e-book, and is it freely redistributable (or at least cheap)?

Well, I did get them from eMule, but I would think they are copyright free. They are the Littré and the Académie Française (I don't know which edition). I see now that they are available from http://ebooksgratuits.com/. I'll download the versions there and try them.

tompe
11-23-2008, 01:51 PM
Huh, you're stripping the bottom two bits from the region code in that example. I'd noticed that they always seemed to be zero, but I hadn't found any particular reason to separate them. I see that the mobi2mobi output is still using the full byte, however.

I also note that the language code of '1036' isn't even valid, and the number makes me think that mobi2mobi has a bad language parser -- 1036 breaks down into 1024+12, meaning that the parser is pulling more than one byte for the language code, and not correctly separating the unknown value. Language code 12, region code 12 is "French (Canada)", however.

I do not undestand what you mean. My code is based on the Kindle Java code. The table I gave i taken directly from the Java code.

Were do you get your information from?

tompe
11-23-2008, 01:56 PM
Here is the language parsing:

sub get_language_desc {
my $code = shift;
my $lid = $code & 0xFF;
my $lang = $mainlanguage{$lid};
my $sublid = ($code >> 10) & 0xFF;
my $sublang = $langmap->{$lang}->{$sublid};
my $res = "";
$res .= "$lang";
$res .= " - $sublang";
return $res;
}


Yes the two lowest bits are not regarded. Maybe It is for a possibility to use 10 bits for the main language identifier?

Jellby
11-23-2008, 02:30 PM
Well, I did get them from eMule, but I would think they are copyright free. They are the Littré and the Académie Française (I don't know which edition). I see now that they are available from http://ebooksgratuits.com/. I'll download the versions there and try them.

OK, I got the files from http://ebooksgratuits.com/. The Académie Française dictionary was the one not working properly (it didn't give result for searches from a French book), but the file I've downloaded now works fine.

The differences I noticed:

The "wrong" version was latin1-encoded, and gives with mobi2mobi:

EXTH doctype: EXTH
EXTH length: 96
EXTH n_items: 2
EXTH item: 100 - Author - 18 - Académie Française
EXTH item: 300 - 300 - 48 - 0x3000000000000008002000000000000000f9beefe41c91e9 1c21e8409340a6

The "right" version is utf8-encoded, and gives with mobi2mobi:

EXTH doctype: EXTH
EXTH length: 144
EXTH n_items: 6
EXTH item: 100 - Author - 20 - Académie Française
EXTH item: 300 - 300 - 48 - 0x3000000000000008002000000000000000e4f9beef1e91c2 1e81c9409340a6
EXTH item: 204 - 204 - 4 - 0x0002
EXTH item: 205 - 205 - 4 - 0x0004
EXTH item: 206 - 206 - 4 - 0x0002
EXTH item: 207 - 207 - 4 - 0x00027

For both the language flag is:

MOBIHEADER language: 1036 - 12 - 1 - FRENCH -

So, now I'm satisfied because it's working, but I don't know why the "wrong" version does not work properly and how it could be fixed.

AZed
11-23-2008, 03:23 PM
I do not undestand what you mean. My code is based on the Kindle Java code. The table I gave i taken directly from the Java code.

Were do you get your information from?
Lots of tedious reverse-engineering. I built test files with Mobipocket Creator for every single possible language, and analyzed the results. After I'd done this, I discovered that the values were almost an exact match to the Microsoft CultureInfo Class (http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo(VS.71).aspx), which confirmed my work.

What Kindle Java code is this, and where did you obtain it?d

AZed
11-23-2008, 03:40 PM
OK, I got the files from http://ebooksgratuits.com/. The Académie Française dictionary was the one not working properly (it didn't give result for searches from a French book), but the file I've downloaded now works fine.
I pulled down 'dictionnaire_academie_francaise_1932-35_8e_edition.prc', but the EBook::Tools parser is giving very different results. I'm finding language code 12, region code 4 for main language, dictionary in language, and dictionary out language, not language code 12, region code 12. The encoding is UTF-8.

Did I download the wrong file?

(And ugh, I've got a bug in my HUFF/CDIC unpacker. I'm running out of bits again during the decompress, and I don't know why.)

tompe
11-23-2008, 05:24 PM
Lots of tedious reverse-engineering. I built test files with Mobipocket Creator for every single possible language, an analyzed the results. After I'd done this, I discovered that the values were almost an exact match to the Microsoft CultureInfo Class (http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo(VS.71).aspx), which confirmed my work.


The Cultureinfo web page says that the code for fr-FR is 0x040C which is 1036. Which is what mobi2mobi printed. So why should that be wrong?


What Kindle Java code is this, and where did you obtain it?

I got it from igorsk so ask him about a link. If that do not work ask me (I can probably find the files but I do not remember now were I put them :-)

AZed
11-23-2008, 06:00 PM
The Cultureinfo web page says that the code for fr-FR is 0x040C which is 1036. Which is what mobi2mobi printed. So why should that be wrong?
Ah, because I had thought the three numbers represented the same three segments I was using (language/region/unknown). My bad. It looks like what mobi2mobi is printing is "unsplit value" / "language code" / "region code >> 2", and once you look at them like that, the numbers make sense again, and the values match what I'm getting. Never mind, then.

AZed
11-23-2008, 06:02 PM
So, now I'm satisfied because it's working, but I don't know why the "wrong" version does not work properly and how it could be fixed.
Jellby, is there any way you could send me the "wrong" version (by uploading it to Rapidshare or some other file hosting service, for instance) so that I could have a peek at it? I'd like to confirm my theory.