Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 11-21-2008, 02:27 PM   #1
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
dictionaries and languages

Hi all,

I have two French dictionaries in mobipocket format, and some French books in my Cybook. The problem is when I look up a word in a French book, only one of the dictionaries is searched, while the other seems to be used for English books (I've had some matches from this other dictionary when looking up a word in English).

I understand this is probably a matter of language settings, however both dictionaries seem to have the same language, according to mobi2mobi:

Code:
MOBIHEADER language: 1036 - 12 - 1 - FRENCH -
while the French books are created with html2mobi, and have:

Code:
MOBIHEADER language: 12 - 12 - 0 - FRENCH -
I didn't find any significant difference between the two dictionaries. The "right" one seems to be utf8-encoded, larger, and I have it in the SD card. The "wrong" one is latin1-encoded, smaller, and it's in the main Cybook memory. (I still haven't tried moving them.) Any ideas on how to fix the "wrong" one?
Jellby is offline   Reply With Quote
Old 11-21-2008, 04:06 PM   #2
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by Jellby View Post
Hi all,

I have two French dictionaries in mobipocket format, and some French books in my Cybook. The problem is when I look up a word in a French book, only one of the dictionaries is searched, while the other seems to be used for English books (I've had some matches from this other dictionary when looking up a word in English).

I understand this is probably a matter of language settings, however both dictionaries seem to have the same language, according to mobi2mobi:

Code:
MOBIHEADER language: 1036 - 12 - 1 - FRENCH -
while the French books are created with html2mobi, and have:

Code:
MOBIHEADER language: 12 - 12 - 0 - FRENCH -
I didn't find any significant difference between the two dictionaries. The "right" one seems to be utf8-encoded, larger, and I have it in the SD card. The "wrong" one is latin1-encoded, smaller, and it's in the main Cybook memory. (I still haven't tried moving them.) Any ideas on how to fix the "wrong" one?
The language code setting consists of two parts. The first (12) is the main laguage. The second part is the "sub language" and they are:
Code:
                   1 => "FRENCH",
                   2 => "FRENCH_BELGIAN",
                   3 => "FRENCH_CANADIAN",
                   4 => "FRENCH_SWISS",
                   5 => "FRENCH_LUXEMBOURG",
                   6 => "FRENCH_MONACO",
The printout from mobi2mobi is "code - main language - sub language".
tompe is offline   Reply With Quote
Advert
Old 11-21-2008, 06:24 PM   #3
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by Jellby View Post
Hi all,

I have two French dictionaries in mobipocket format, and some French books in my Cybook. The problem is when I look up a word in a French book, only one of the dictionaries is searched, while the other seems to be used for English books (I've had some matches from this other dictionary when looking up a word in English).

I understand this is probably a matter of language settings, however both dictionaries seem to have the same language, according to mobi2mobi:

Code:
MOBIHEADER language: 1036 - 12 - 1 - FRENCH -
while the French books are created with html2mobi, and have:

Code:
MOBIHEADER language: 12 - 12 - 0 - FRENCH -
I didn't find any significant difference between the two dictionaries. The "right" one seems to be utf8-encoded, larger, and I have it in the SD card. The "wrong" one is latin1-encoded, smaller, and it's in the main Cybook memory. (I still haven't tried moving them.) Any ideas on how to fix the "wrong" one?
A problem related to formats or cybook would probably be better handled in the Cybook group but as I recall the Cybook needs the dictionary to be in the main memory, not on a card.

Dale
DaleDe is offline   Reply With Quote
Old 11-23-2008, 12:43 PM   #4
AZed
Connoisseur
AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.
 
Posts: 57
Karma: 307
Join Date: Oct 2008
Device: PalmOS PDA
Quote:
Originally Posted by tompe View Post
The language code setting consists of two parts. The first (12) is the main laguage. The second part is the "sub language" and they are:
Code:
                   1 => "FRENCH",
                   2 => "FRENCH_BELGIAN",
                   3 => "FRENCH_CANADIAN",
                   4 => "FRENCH_SWISS",
                   5 => "FRENCH_LUXEMBOURG",
                   6 => "FRENCH_MONACO",
The printout from mobi2mobi is "code - main language - sub language".
Huh, you're stripping the bottom two bits from the region code in that example. I'd noticed that they always seemed to be zero, but I hadn't found any particular reason to separate them. I see that the mobi2mobi output is still using the full byte, however.

I also note that the language code of '1036' isn't even valid, and the number makes me think that mobi2mobi has a bad language parser -- 1036 breaks down into 1024+12, meaning that the parser is pulling more than one byte for the language code, and not correctly separating the unknown value. Language code 12, region code 12 is "French (Canada)", however.

I'm more interested by the fact that there is a nonzero unknown value at all, though. Where did you obtain this e-book, and is it freely redistributable (or at least cheap)? I'd be interested in seeing what the EBook::Tools parser makes of it. My offhanded guess is that while the main language is set correctly on one of the dictionaries, the dictionary language values are wrong. (There are actually three language codes embedded -- one for the main language, one for the dictionary input language, and one for the dictionary output language.)
AZed is offline   Reply With Quote
Old 11-23-2008, 01:16 PM   #5
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by tompe View Post
Where did you obtain this e-book, and is it freely redistributable (or at least cheap)?
Well, I did get them from eMule, but I would think they are copyright free. They are the Littré and the Académie Française (I don't know which edition). I see now that they are available from http://ebooksgratuits.com/. I'll download the versions there and try them.
Jellby is offline   Reply With Quote
Advert
Old 11-23-2008, 01:51 PM   #6
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by AZed View Post
Huh, you're stripping the bottom two bits from the region code in that example. I'd noticed that they always seemed to be zero, but I hadn't found any particular reason to separate them. I see that the mobi2mobi output is still using the full byte, however.

I also note that the language code of '1036' isn't even valid, and the number makes me think that mobi2mobi has a bad language parser -- 1036 breaks down into 1024+12, meaning that the parser is pulling more than one byte for the language code, and not correctly separating the unknown value. Language code 12, region code 12 is "French (Canada)", however.
I do not undestand what you mean. My code is based on the Kindle Java code. The table I gave i taken directly from the Java code.

Were do you get your information from?
tompe is offline   Reply With Quote
Old 11-23-2008, 01:56 PM   #7
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Here is the language parsing:
Code:
sub get_language_desc {
    my $code = shift;
    my $lid = $code & 0xFF;
    my $lang = $mainlanguage{$lid};
    my $sublid = ($code >> 10) & 0xFF;
    my $sublang = $langmap->{$lang}->{$sublid};
    my $res = "";
    $res .= "$lang";
    $res .= " - $sublang";
    return $res;
}
Yes the two lowest bits are not regarded. Maybe It is for a possibility to use 10 bits for the main language identifier?
tompe is offline   Reply With Quote
Old 11-23-2008, 02:30 PM   #8
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by Jellby View Post
Well, I did get them from eMule, but I would think they are copyright free. They are the Littré and the Académie Française (I don't know which edition). I see now that they are available from http://ebooksgratuits.com/. I'll download the versions there and try them.
OK, I got the files from http://ebooksgratuits.com/. The Académie Française dictionary was the one not working properly (it didn't give result for searches from a French book), but the file I've downloaded now works fine.

The differences I noticed:

The "wrong" version was latin1-encoded, and gives with mobi2mobi:

Code:
EXTH doctype: EXTH
EXTH  length: 96
EXTH n_items: 2
EXTH    item: 100 - Author - 18 - Académie Française
EXTH    item: 300 - 300 - 48 - 0x3000000000000008002000000000000000f9beefe41c91e91c21e8409340a6
The "right" version is utf8-encoded, and gives with mobi2mobi:

Code:
EXTH doctype: EXTH
EXTH  length: 144
EXTH n_items: 6
EXTH    item: 100 - Author - 20 - Académie Française
EXTH    item: 300 - 300 - 48 - 0x3000000000000008002000000000000000e4f9beef1e91c21e81c9409340a6
EXTH    item: 204 - 204 - 4 - 0x0002
EXTH    item: 205 - 205 - 4 - 0x0004
EXTH    item: 206 - 206 - 4 - 0x0002
EXTH    item: 207 - 207 - 4 - 0x00027
For both the language flag is:

Quote:
MOBIHEADER language: 1036 - 12 - 1 - FRENCH -
So, now I'm satisfied because it's working, but I don't know why the "wrong" version does not work properly and how it could be fixed.
Jellby is offline   Reply With Quote
Old 11-23-2008, 03:23 PM   #9
AZed
Connoisseur
AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.
 
Posts: 57
Karma: 307
Join Date: Oct 2008
Device: PalmOS PDA
Quote:
Originally Posted by tompe View Post
I do not undestand what you mean. My code is based on the Kindle Java code. The table I gave i taken directly from the Java code.

Were do you get your information from?
Lots of tedious reverse-engineering. I built test files with Mobipocket Creator for every single possible language, and analyzed the results. After I'd done this, I discovered that the values were almost an exact match to the Microsoft CultureInfo Class, which confirmed my work.

What Kindle Java code is this, and where did you obtain it?d

Last edited by AZed; 11-23-2008 at 06:05 PM. Reason: typo
AZed is offline   Reply With Quote
Old 11-23-2008, 03:40 PM   #10
AZed
Connoisseur
AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.
 
Posts: 57
Karma: 307
Join Date: Oct 2008
Device: PalmOS PDA
Quote:
Originally Posted by Jellby View Post
OK, I got the files from http://ebooksgratuits.com/. The Académie Française dictionary was the one not working properly (it didn't give result for searches from a French book), but the file I've downloaded now works fine.
I pulled down 'dictionnaire_academie_francaise_1932-35_8e_edition.prc', but the EBook::Tools parser is giving very different results. I'm finding language code 12, region code 4 for main language, dictionary in language, and dictionary out language, not language code 12, region code 12. The encoding is UTF-8.

Did I download the wrong file?

(And ugh, I've got a bug in my HUFF/CDIC unpacker. I'm running out of bits again during the decompress, and I don't know why.)
AZed is offline   Reply With Quote
Old 11-23-2008, 05:24 PM   #11
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by AZed View Post
Lots of tedious reverse-engineering. I built test files with Mobipocket Creator for every single possible language, an analyzed the results. After I'd done this, I discovered that the values were almost an exact match to the Microsoft CultureInfo Class, which confirmed my work.
The Cultureinfo web page says that the code for fr-FR is 0x040C which is 1036. Which is what mobi2mobi printed. So why should that be wrong?

Quote:
What Kindle Java code is this, and where did you obtain it?
I got it from igorsk so ask him about a link. If that do not work ask me (I can probably find the files but I do not remember now were I put them :-)
tompe is offline   Reply With Quote
Old 11-23-2008, 06:00 PM   #12
AZed
Connoisseur
AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.
 
Posts: 57
Karma: 307
Join Date: Oct 2008
Device: PalmOS PDA
Quote:
Originally Posted by tompe View Post
The Cultureinfo web page says that the code for fr-FR is 0x040C which is 1036. Which is what mobi2mobi printed. So why should that be wrong?
Ah, because I had thought the three numbers represented the same three segments I was using (language/region/unknown). My bad. It looks like what mobi2mobi is printing is "unsplit value" / "language code" / "region code >> 2", and once you look at them like that, the numbers make sense again, and the values match what I'm getting. Never mind, then.

Last edited by AZed; 11-23-2008 at 06:05 PM.
AZed is offline   Reply With Quote
Old 11-23-2008, 06:02 PM   #13
AZed
Connoisseur
AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.AZed has a complete set of Star Wars action figures.
 
Posts: 57
Karma: 307
Join Date: Oct 2008
Device: PalmOS PDA
Quote:
Originally Posted by Jellby View Post
So, now I'm satisfied because it's working, but I don't know why the "wrong" version does not work properly and how it could be fixed.
Jellby, is there any way you could send me the "wrong" version (by uploading it to Rapidshare or some other file hosting service, for instance) so that I could have a peek at it? I'd like to confirm my theory.
AZed is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Why can't I read in other languages? Margok Kobo Reader 61 01-29-2015 04:42 AM
Dictionary Languages petblue Ectaco jetBook 2 07-19-2010 07:26 PM
Support for other languages? Matth3w Calibre 8 04-15-2009 02:25 AM
Languages other than English ThePage Feedback 22 03-13-2009 11:56 PM
Other languages? irishjew Sony Reader 2 07-17-2007 03:30 PM


All times are GMT -4. The time now is 06:37 AM.


MobileRead.com is a privately owned, operated and funded community.