MobileRead Forums - View Single Post - Libmobi

baf · 11-17-2014, 07:12 AM

Hi again,
I found some time lately to move my project forward a bit.
I updated my post at the beginning of this thread with newer binaries for testing.

What has been added?
I mainly focused on handling dictionaries – added support for inflections (also old inflections scheme found in older dictionaries built with mobigen).
I also fixed many bugs related to parsing non-dictionary files.

While the project lacks extensive testing, mobitool successfully unpacked all the files I managed to grab.
For dictionaries I made tests on files found on mobileread forums, as well as dictionaries from my own kindle (dedrmed).

For developers:
I want to share some of my findings here. I hope I don't reinvent wheel here, but I couldn't find this information anywhere.

First thing is inflection scheme found in some older dictionaries. It uses tag 7 in inflections index. The tag contains pairs of values: an offset of the inflection rule in ctoc record and length of the string. It seems that each rule must be applied to all the entries in the orthographic index which end with a string matching entry label in inflection index. This old scheme was probably lossy, it is impossible to recreate exact source of the infection rules. However, decompiled source should still produce same compiled dictionary file. This "loosyness" may be observed by searching an old scheme dictionary on Kindle (I did it on Kindle.app). For example a search for a non-existant word "guidarka" in The New Oxford American Dictionary will bring entry "guitar" :) Why? I leave it as a riddle.

I also found a number of tags used in orthographic indices. Tags 22 and 25 contain offsets of main entry in the same index (link to it). Tags 5, 40, 53, 69, 70, 71 all point to ctoc entries in various indices (orth, names, keys). Generally they match source tags <idx:string\>, <idx:keys\>, <idx:orth format\> and others. I didn't implement their reconstruction, as this information is probably not so important.

Another thing I discovered is that some orthographic indices substitute some latin ligatures with custom replacements. It probably facilitates search. In older dictionaries these replacements are listed in LIGT section of the header. These are four byte entries: two bytes for ligature and two bytes for replacing bytes. As far as I know the section always lists same five ligatures, irrelevant of the ligatures used in document. In newer documents the section is missing, but replacement is still being done. I didn't found the list of replacements anywhere in document in such case. I assume we still have to deal with the standard set of five ligatures. Reconstructed html documents with ligatures that haven't been repaired contain characters with codes in range 0x1–0x5.
The ligatures are decomposed and first character is replaced by a control character. The five cases are:
OE => 0x1,
oe => 0x2,
AE => 0x3,
ae => 0x4,
ss => 0x5.
So instead of ligature "Œ" there are 2 bytes: 0x1, 0x45.

That's all for this quick summary. I will be glad to answer any questions if anything is not clear.

11-17-2014, 07:12 AM	#13
baf Evangelist Posts: 404 Karma: 2200000 Join Date: May 2012 Device: kt	Hi again, I found some time lately to move my project forward a bit. I updated my post at the beginning of this thread with newer binaries for testing. What has been added? I mainly focused on handling dictionaries – added support for inflections (also old inflections scheme found in older dictionaries built with mobigen). I also fixed many bugs related to parsing non-dictionary files. While the project lacks extensive testing, mobitool successfully unpacked all the files I managed to grab. For dictionaries I made tests on files found on mobileread forums, as well as dictionaries from my own kindle (dedrmed). For developers: I want to share some of my findings here. I hope I don't reinvent wheel here, but I couldn't find this information anywhere. First thing is inflection scheme found in some older dictionaries. It uses tag 7 in inflections index. The tag contains pairs of values: an offset of the inflection rule in ctoc record and length of the string. It seems that each rule must be applied to all the entries in the orthographic index which end with a string matching entry label in inflection index. This old scheme was probably lossy, it is impossible to recreate exact source of the infection rules. However, decompiled source should still produce same compiled dictionary file. This "loosyness" may be observed by searching an old scheme dictionary on Kindle (I did it on Kindle.app). For example a search for a non-existant word "guidarka" in The New Oxford American Dictionary will bring entry "guitar" :) Why? I leave it as a riddle. I also found a number of tags used in orthographic indices. Tags 22 and 25 contain offsets of main entry in the same index (link to it). Tags 5, 40, 53, 69, 70, 71 all point to ctoc entries in various indices (orth, names, keys). Generally they match source tags <idx:string\>, <idx:keys\>, <idx:orth format\> and others. I didn't implement their reconstruction, as this information is probably not so important. Another thing I discovered is that some orthographic indices substitute some latin ligatures with custom replacements. It probably facilitates search. In older dictionaries these replacements are listed in LIGT section of the header. These are four byte entries: two bytes for ligature and two bytes for replacing bytes. As far as I know the section always lists same five ligatures, irrelevant of the ligatures used in document. In newer documents the section is missing, but replacement is still being done. I didn't found the list of replacements anywhere in document in such case. I assume we still have to deal with the standard set of five ligatures. Reconstructed html documents with ligatures that haven't been repaired contain characters with codes in range 0x1–0x5. The ligatures are decomposed and first character is replaced by a control character. The five cases are: OE => 0x1, oe => 0x2, AE => 0x3, ae => 0x4, ss => 0x5. So instead of ligature "Œ" there are 2 bytes: 0x1, 0x45. That's all for this quick summary. I will be glad to answer any questions if anything is not clear.