KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 64

poxalew · 08-22-2014, 06:03 PM

Hi there!
First of all, thanks for the wonderful tool!

Found a bug in mobi_header.py, namely: The contents of header 208 (watermark) is dumped as a string into the generated .opf. Since under certain circumstances it can contain code points outside of the range allowable in XML, in those cases the resulting .opf becomes invalid.

The fix is very simple--it just involves moving 208 from id_map_strings to id_map_hexstrings. Sample patch against 0.73 below.

Code:

--- mobi_header.py.orig 2014-07-14 18:32:44.000000000 +0300
+++ mobi_header.py      2014-08-23 00:50:43.312531211 +0300
@@ -71,7 +71,6 @@
         129 : 'K8_Masthead/Cover_Image_(129)',
         132 : 'RegionMagnification_(132)',
         200 : 'DictShortName_(200)',
-        208 : 'Watermark_(208)',
         501 : 'cdeType_(501)',
         502 : 'last_update_time_(502)',
         503 : 'Updated_Title_(503)',
@@ -113,6 +112,7 @@
         404 : 'Text_to_Speech_Disabled_(404)',
     }
     id_map_hexstrings = {
+        208 : 'Watermark_(208_in_hex)',
         209 : 'Tamper_Proof_Keys_(209_in_hex)',
         300 : 'Font_Signature_(300_in_hex)',
     }
@@ -370,7 +370,6 @@
         129 : 'K8(129)_Masthead/Cover_Image',
         132 : 'RegionMagnification',
         200 : 'DictShortName',
-        208 : 'Watermark',
         501 : 'cdeType',
         502 : 'last_update_time',
         503 : 'Updated_Title',
@@ -412,6 +411,7 @@
         406 : 'Rental_Indicator',
     }
     id_map_hexstrings = {
+        208 : 'Watermark (hex)',
         209 : 'Tamper Proof Keys (hex)',
         300 : 'Font Signature (hex)',
         403 : 'Unknown_(403) (hex)',

KevinH · 08-23-2014, 12:38 PM

Hi poxalew,

Your patch makes good sense. I will include it in the next official release.

Thanks!

KevinH

KevinH · 08-26-2014, 04:29 PM

Hi lglgaigogo,

I really am not good at the dictionary stuff at all. I am not its original author. I did however make a change to mobi_dict.py to try and capture and read the ORDT table info. And although I really have no idea what the markup for dictionaries is supposed to look like, the idx

rth info now looks like it is deciphering properly.

Will you please download the attached mobi_dict.py.zip, unzip it and use it to replace its namesake in KindleUnpack_v073/lib/.

Then try and unpack your dictionary and see if you can see any improvement and let me know what else if anything remains to be fixed.

Thanks,

KevinH

Quote:

Originally Posted by lglgaigogo

Thank you for paying attention on my issue. I am now try to understand the non western character encoding pattern.
Thank you.

For now, I figure out:

1.Every character has 2 bytes index
2.For western letters it should be like 00 XX ,for example, 'a' is 00 03, 'b' is 00 64, and look up the table ORDT:
ORDT[3*2+1] is 'a'
ORDT[64*2+1] is 'b'
3.For non western letters, it should be like XX XX, for example, '潘' is 6F 58, and in python：

Code:

 print u"\u6F58" # is exactly the character '潘'

Doitsu · 08-28-2014, 10:40 AM

Hi Kevin,

I tested the latest version of mobi_dict.py (by overwriting the same file in KindleUnpack_v073) with this Mobipocket dictionary and I got an Error: Unpacking Failed message.

(The same file unpacked fine with the old mobi_dict.py version, but KindleUnpack wrote non-printable characters (hex 00, 03, 04, 05 etc.) to the .html file.)

If your latest mobi_dict.py version requires a patched KindleUnpack_v073 version, can you please attach an unofficial test version that has all the required patches applied and includes your latest mobi_dict.py version?

KevinH · 08-29-2014, 05:49 PM

Hi Doitsu,

This new version of mobi_dict.py was based on stock 0.73. Nothing else changed. I tested it against the original posters Collin's dictionary and it helps.

I will grab your dictionary and see if I can figure out what I have messed up and try to get something that works for both.

Thanks!

KevinH

Quote:

Originally Posted by Doitsu

Hi Kevin,

I tested the latest version of mobi_dict.py (by overwriting the same file in KindleUnpack_v073) with this Mobipocket dictionary and I got an Error: Unpacking Failed message.

(The same file unpacked fine with the old mobi_dict.py version, but KindleUnpack wrote non-printable characters (hex 00, 03, 04, 05 etc.) to the .html file.)

If your latest mobi_dict.py version requires a patched KindleUnpack_v073 version, can you please attach an unofficial test version that has all the required patches applied and includes your latest mobi_dict.py version?

KevinH · 08-29-2014, 11:41 PM

Hi Doitsu,

Okay here is a version of mobi_dict.py that will work for your sven.prc dictionary. I still have to see if this will at all work for dictionaries in general. Hopefully it will.

Please note, this version of mobi_dict.py has tons of debugging turned on. We will turn that off once we know everything is working.

Please give it a try and let me know if it works for you.

Take care,

KevinH

Doitsu · 08-30-2014, 02:52 AM

Quote:

Originally Posted by KevinH

Okay here is a version of mobi_dict.py that will work for your sven.prc dictionary. I still have to see if this will at all work for dictionaries in general. Hopefully it will.

Thanks again for your work!

The updated version worked great with my home-made, Mobipocket Creator generated dictionary, but still seems to have problems with some Kindlegen compiled dictionaries.

For example, when I decompiled the DeDRMed version of the default monolingual Oxford Dictionary of English (B003WUYRGI_EBOK.azw) for testing purposes, the reverse-engineered .html file didn't contain a single <idx:infl> tag nor multiple <idx:orth> tags, and I know for sure that this dictionary contains inflections.

The script failed with the default monolingual French Kindle dictionary, (B005F12G6U_EBOK.azw). I've got the following error message:

Code:

Read dictionary index data
Parsing dictionary index data 12200
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 0 type 1 gen 0 start E82C count 9D8 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 2520, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 59436, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
Error: unpack requires a string argument of length 0

Error: Unpacking Failed

I'll try to create a test case with compressed and uncompressed Mobipocket and KindleGen generated dictionaries for you tomorrow.

KevinH · 08-30-2014, 10:07 AM

Hi Doitsu,
Does the old mobi_dict create the same error or is this error directly related to the changes I made?

There are simply too many struct.unpack calls to know where this might be happening without access to a test case and adding lots of extra print statements.

Take care,

KevinH

Quote:

Originally Posted by Doitsu

Thanks again for your work!

The updated version worked great with my home-made, Mobipocket Creator generated dictionary, but still seems to have problems with some Kindlegen compiled dictionaries.

For example, when I decompiled the DeDRMed version of the default monolingual Oxford Dictionary of English (B003WUYRGI_EBOK.azw) for testing purposes, the reverse-engineered .html file didn't contain a single <idx:infl> tag nor multiple <idx:orth> tags, and I know for sure that this dictionary contains inflections.

The script failed with the default monolingual French Kindle dictionary, (B005F12G6U_EBOK.azw). I've got the following error message:

Code:

Read dictionary index data
Parsing dictionary index data 12200
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 0 type 1 gen 0 start E82C count 9D8 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 2520, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 59436, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
Error: unpack requires a string argument of length 0

Error: Unpacking Failed

I'll try to create a test case with compressed and uncompressed Mobipocket and KindleGen generated dictionaries for you tomorrow.

Doitsu · 08-30-2014, 10:16 AM

Quote:

Originally Posted by KevinH

Hi Doitsu,
Does the old mobi_dict create the same error or is this error directly related to the changes I made?

The old mobi_dict.py unpacked the French dictionary without problems, but also wrote some non-printable characters to the inflection strings.

KevinH · 08-30-2014, 04:21 PM

Hi Doitsu,

Please try the following change in mobi_dict.py

Change the following:

Code:

                    if hordt2 is not None:
                        utext = u""
                        n = len(text)/2
                        offsets = struct.unpack('>%dH' % n, text)
                        for off in offsets:
                            if off < len(hordt2):
                                utext += unichr(hordt2[off])
                            else:
                                utext += unichr(off)
                        text = utext.encode('utf-8')

to

Code:

                    if hordt2 is not None and len(text) > 0:
                        utext = u""
                        n = len(text)/2
                        offsets = struct.unpack('>%dH' % n, text)
                        for off in offsets:
                            if off < len(hordt2):
                                utext += unichr(hordt2[off])
                            else:
                                utext += unichr(off)
                        text = utext.encode('utf-8')

Just the first line is changed. I think text is a null string which is freaking out struct.unpack() so we are screening this case out.

Please let me know if this helps.

KevinH

Doitsu · 08-30-2014, 07:00 PM

Hi KevinH,

Unfortunately, changing

Code:

if hordt2 is not None:

to

Code:

if hordt2 is not None and len(text) > 0:

didn't have the desired effect. The French dictionary still failed to unpack, but this time Kindleunpack displayed a longer error message:

Spoiler:

Code:

Error: Dictionary contains multiple inflection index sections, which is not yet supported
Parsing inflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 2 type 1 gen 0 start EDA0 count 726 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 1830, 'nctoc': 0, 'code': 4294967295L, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 60832, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
inflectionTagTable: [(5, 1, 3, 0), (26, 1, 12, 0), (27, 1, 48, 0), (0, 0, 0, 1)]
Parsing metaOrthIndex
ocnt 1, oentries 177, op1 1124, op2 1308, otagx 192
parsed INDX header:
len C8 nul1 0 type 0 gen 2 start 3E4 count 3D code FDEA lng 40C total 21DCA ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 61, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 200, 'ligt': 0, 'start': 996, 'nligt': 0, 'ordt': 0, 'lng': 1036, 'total': 138698, 'type': 0, 'gen': 2} (0, 0, 0, 21, 65, 74, 95, 7, 94, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 26, 29, 32, 37, 40, 43, 45, 48, 51, 54, 56, 57, 59, 68, 70, 73, 76, 82, 83, 85, 88, 90, 93, 21, 26, 29, 32, 37, 40, 43, 45, 48, 51, 54, 56, 57, 59, 65, 68, 70, 73, 74, 76, 82, 83, 85, 88, 90, 93, 21, 0, 82, 37, 37, 0, 65, 21, 37, 48, 48, 29, 65, 21, 21, 82, 65, 37, 21, 21, 48, 59, 37, 65, 0, 0, 90, 82) (0, 37, 95, 97, 111, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 224, 45, 251, 233, 232, 39, 244, 226, 234, 239, 238, 231, 2, 4, 227, 252, 246, 235, 228, 225, 237, 241, 201, 243, 40, 41, 255, 249)
orthIndexCount is 61
orthTagTable: [(1, 1, 1, 0), (2, 1, 2, 0), (42, 1, 4, 0), (8, 2, 8, 0), (0, 0, 0, 1)]
Read dictionary index data
Parsing dictionary index data 12200
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 0 type 1 gen 0 start E82C count 9D8 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 2520, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 59436, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
Error: unpack requires a string argument of length 0

Error: Unpacking Failed

(I've got similar longer messages for many of my other test files.)

I've created a small German-English proof-of-concept test dictionary based on the latest recommendations from the Kindle Publishing Guidelines for you that I compiled with both Mobigen and Kindlegen without error messages. Hopefully, this will make it easier for you to reverse-engineer the binary files, since you have access to the actual source files.
The .zip file contains the source files and 3 binaries each generated without compression (c0), with standard compression (c1) and maximum compression (c2).

BTW, I believe one reason why the Swedish dictionary decompiled fine, is because I used a redundant inflection syntax:

I used for example:

Code:

<idx:entry>
	<b><idx:orth>positiv
	<idx:infl><idx:iform value="positivs"/></idx:infl>
	<idx:infl><idx:iform value="positivet"/></idx:infl>
	<idx:infl><idx:iform value="positivets"/></idx:infl>
	<idx:infl><idx:iform value="positiven"/></idx:infl>
	<idx:infl><idx:iform value="positivens"/></idx:infl>
	<idx:infl><idx:iform value="positiv-"/></idx:infl>
	</idx:orth> </b> 
	<i>subst</i> <br/>
	absolute
</idx:entry>

instead of the syntax recommended in the guidelines:

Code:

<idx:entry>
	<b><idx:orth>positiv
	<idx:infl>
		<idx:iform value="positivs"/>
		<idx:iform value="positivet"/>
		<idx:iform value="positivets"/>
		<idx:iform value="positiven"/>
		<idx:iform value="positivens"/>
		<idx:iform value="positiv-"/>
	</idx:infl>
	</idx:orth> </b> 
	<i>subst</i> <br/>
	absolute
</idx:entry>

KevinH · 08-31-2014, 02:38 PM

Hi Doitsu,

A few things ...

1) compression has no impact on the issue so use whatever compression you like

2) your sven.prc dictionary acts very differently from kindlegen genererated ones

It is point 2) that is giving me fits.

In your sven.prc each character in each orth entry is replaced with a set of two byte long offsets into the ORDT table. You take each 16 bit offset and use it to look up the right unicode code point value (stored in the ORDT table) to properly build up each orth entry.

Here are the first few orth entries (hex encoded) where you can see it uses two byte long offset values:

005c 005e 0008 0006 0075 007a 0066 006e 0068 0071
005c 009c 0004
005c 009c 0068

and here is the table of unicode code points that each 16 bit quantity provides an offset into:

Code:

(0, 37, 95, 45, 97, 111, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57
, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 246, 228, 229, 46, 40, 41, 58, 224, 233, 33, 59, 47, 234, 63, 225, 237, 252, 44, 39, 34, 196, 337, 232, 231, 91, 214, 238, 227, 346, 347, 244, 8230, 197)

In **all** other dictionaries I have seen (and that is only about 10 or so) each character of each orth entry is replaced with a *single* byte long offset that you use to look up that characters unicode code point in the ORDT table.

Here are your orth entries from your Sample Dictionary again hex encoded:

03 6e 67 6b 03 67 70
7c 75 65 6a
87 05 70 70
06 74 67 6b 69 67 70

and here is the ORDT table to look these single byte offset up in :

Code:

(0, 37, 95, 98, 111, 97, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90)

The problem is the headers that describe the ORDT tables for each dictionary show no significant differences to know how wide each offset is supposed to be.

Here is the header that describes the table for sven.prc

Code:

Parsing metaOrthIndex
ocnt 0, oentries 183, op1 1180, op2 1552, otagx 192

len C0 nul1 0 type 0 gen 2 start 458 count 1F code FDEA lng 41D total CC41 ordt 0 ligt 0 nligt 0 nctoc 0

{'count': 31, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 1112, 'nligt': 0, 'ordt': 0, 'lng': 1053, 'total': 52289, 'type': 0, 'gen': 2}

and here is the header that describes the ORDT table for your sample dictionary

Code:

Parsing metaOrthIndex
ocnt 1, oentries 149, op1 248, op2 404, otagx 192

len C6 nul1 0 type 0 gen 2 start F0 count 1 code FDEA lng 407 total 4 ordt 0 ligt 0 nligt 0 nctoc 0

{'count': 1, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 198, 'ligt': 0, 'start': 240, 'nligt': 0, 'ordt': 0, 'lng': 1031, 'total': 4, 'type': 0, 'gen': 2}

So I see nothing that will let me know if the offsets used are 16 bits wide or 8 bits wide.

Furthermore using 16 bit wide offsets makes NO SENSE as you could put the entire actual unicode code point in that 16 bit value so there is no savings in size at all and no reason to use any ORDT table. The ORDT table itself only has 183 entries so a single byte offset would have more than enough room to do what it needs.

The whole size/space saving value comes from having 8 bit offsets into one ORDT table which holds all of the needed multibyte values. Only if more than 256 different single characters are needed would you bother to use 16 bit values but if you go that far you don't need the ORDT table approach at all. just use utf-16 encoded unicode.

In addition, your sven.prc does not have the correct ocnt value. It should be 1 to indicate that an ORDT table is being used in the first place, and 0 otherwise. It was an "assert" about that that caused it to fail to unpack in the first place.

To save space, the inflections are not actually stored as text. Instead common prefix and suffix rules are stored instead which are used to recreate all of the inflections from the base orth entry without having to store them.

But if those orth entries are wrong, then all inflections are wrong as well and you end up with gibberish.

So I can fix things so that your sample works or I can fix things so that sven works but until I can find something to indicate which is correct or how to tell them apart, I am at a loss about how to proceed. Furthermore, I think there is something wrong with the sven dictionary as it uses 16 bit offsets into a table with only 183 entries and so takes up much more room than it needs and actually wastes space.

Please let me know how to proceed. I may be able to hack my way around it by checking if the length of the ORDT table is less than 256 and if so, if every other character is a zero byte, then either ignore every other character or use a 16 bit offset into what is only a small table.

Here is that same snippet of code reworked to ignore every null 0 offset byte but this may cause problems to crop up if the 0 offset really should be a true character.

Please try replacing that entire snippet with the following:

Code:

                    if hordt2 is not None:
                        # print text.encode('hex')
                        utext = u""
                        for x in text:
                            off, = struct.unpack('>B', x)
                            if off == 0:
                                continue
                            if off < len(hordt2):
                                utext += unichr(hordt2[off])
                            else:
                                utext += unichr(off)
                        text = utext.encode('utf-8')

This will appear to work for both cases but it is far too hackish for my likes.
Please check it against other dictionaries to see how it responds to them.

Please note: the mobi_dict.py code still can't deal with more than one mobi section of rules for doing inflections so if any dictionary needs more than one section it will stop from reporting any inflections. You can see this in the debug output. I am guessing this is what is happening with some dictionaries. I would really need access to such a dictionary to debug how multiple sections of inflection rules are actually used.

Take care,

KevinH

Doitsu · 08-31-2014, 07:12 PM

Hi KevinH,

Quote:

Originally Posted by KevinH

So I can fix things so that your sample works or I can fix things so that sven works but until I can find something to indicate which is correct or how to tell them apart, I am at a loss about how to proceed.
[...]
Furthermore, I think there is something wrong with the sven dictionary as it uses 16 bit offsets into a table with only 183 entries and so takes up much more room than it needs and actually wastes space.

Sínce the source code of sven.prc wasn't coded according to Amazon's guidelines, it's no wonder it's wasting space; you shouldn't waste your time making Kindleunpack work with it. (I merely mentioned it, because I created it myself and had the source code for it.)

Ideally, Kindleunpack should be able to decompile home-made dictionaries with inflections coded according to the Kindle Guidelines and compiled with Kindlegen, which the version that you included in you latest post definitely does.

AFAIK, this is the very first version of mobi_dict.py to do so, which is no small feat!

I've tested your latest version with my test dictionary and the default monolingual Kindle dictionaries.

The results were as follows:

1. Test dictionary: almost perfectly reverse-engineered. The only thing missing was the spell="yes" attribute, but I'm not sure if this attribute actually does anything.

2. French dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

3. Spanish dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

4. Portuguese dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

6. Italian dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

5. German dictionary: unpacking failed shortly after displaying: "Error: Dictionary contains multiple inflection index sections, which is not yet supported"

The full error log is here:

Spoiler:

Code:

Info: Document contains orthographic index, handle as dictionary
Parsing metaInflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C0 nul1 2 type 0 gen 2 start FC count 4 code 4E4 lng FFFFFFFF total 36FB ordt 0 ligt 0 nligt 0 nctoc 1
{'count': 4, 'nctoc': 1, 'code': 1252, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 252, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 14075, 'type': 0, 'gen': 2} None None
Error: Dictionary contains multiple inflection index sections, which is not yet supported
Parsing inflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 2 type 1 gen 0 start DC54 count FD0 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 4048, 'nctoc': 0, 'code': 4294967295L, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 56404, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
inflectionTagTable: [(7, 2, 3, 0), (0, 0, 0, 1)]
Error: Dictionary uses obsolete inflection rule scheme which is not yet supported
Parsing metaOrthIndex
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C3 nul1 0 type 0 gen 2 start 440 count 3C code 4E4 lng 407 total 23F91 ordt 4BC ligt 5C0 nligt 5 nctoc 1
{'count': 60, 'nctoc': 1, 'code': 1252, 'nul1': 0, 'len': 195, 'ligt': 1472, 'start': 1088, 'nligt': 5, 'ordt': 1212, 'lng': 1031, 'total': 147345, 'type': 0, 'gen': 2} None None
orthIndexCount is 60
orthTagTable: [(1, 1, 1, 0), (2, 1, 2, 0), (5, 1, 4, 0), (22, 1, 8, 0), (25, 1, 16, 0), (8, 2, 32, 0), (69, 1, 192, 0), (0, 0, 0, 1)]
Read dictionary index data
Parsing dictionary index data 15627
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 0 type 1 gen 0 start E68C count AB4 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 2740, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 59020, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
Error: local variable 'tagMap' referenced before assignment


Error: Unpacking Failed

7. UK English dictionary: unpacking failed shortly after displaying: "Error: Dictionary contains multiple inflection index sections, which is not yet supported"

The full error log is here:

Spoiler:

Code:

Info: Document contains orthographic index, handle as dictionary
Parsing metaInflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C0 nul1 2 type 0 gen 2 start DC count 1 code 4E4 lng FFFFFFFF total 463 ordt 0 ligt 0 nligt 0 nctoc 1
{'count': 1, 'nctoc': 1, 'code': 1252, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 220, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 1123, 'type': 0, 'gen': 2} None None
Parsing inflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 2 type 1 gen 0 start 3B98 count 463 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 1123, 'nctoc': 0, 'code': 4294967295L, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 15256, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
inflectionTagTable: [(7, 2, 3, 0), (0, 0, 0, 1)]
Error: Dictionary uses obsolete inflection rule scheme which is not yet supported
Parsing metaOrthIndex
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C8 nul1 0 type 0 gen 2 start 2C8 count 28 code 4E4 lng 809 total 23899 ordt 31C ligt 420 nligt 5 nctoc 1
{'count': 40, 'nctoc': 1, 'code': 1252, 'nul1': 0, 'len': 200, 'ligt': 1056, 'start': 712, 'nligt': 5, 'ordt': 796, 'lng': 2057, 'total': 145561, 'type': 0, 'gen': 2} None None
orthIndexCount is 40
orthTagTable: [(1, 1, 1, 0), (5, 1, 2, 0), (22, 1, 4, 0), (25, 1, 8, 0), (69, 1, 16, 0), (0, 0, 0, 1)]
Info: Index doesn't contain entry length tags
Read dictionary index data
Parsing dictionary index data 16740
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 0 type 1 gen 0 start DD18 count F69 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 3945, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 56600, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
Error: local variable 'tagMap' referenced before assignment


Error: Unpacking Failed

Quote:

Originally Posted by KevinH

Please let me know how to proceed. I may be able to hack my way around it by checking if the length of the ORDT table is less than 256 and if so, if every other character is a zero byte, then either ignore every other character or use a 16 bit offset into what is only a small table.

IMHO, your latest version is almost perfect. All that's left is figuring out why the German and UK English dictionaries failed to unpack. Once that's done, you'll only need to remove the debug code and your code is ready for pre-release testing.

Quote:

Originally Posted by KevinH

Please note: the mobi_dict.py code still can't deal with more than one mobi section of rules for doing inflections so if any dictionary needs more than one section it will stop from reporting any inflections. You can see this in the debug output. I am guessing this is what is happening with some dictionaries. I would really need access to such a dictionary to debug how multiple sections of inflection rules are actually used.

IMHO, support for multiple inflection index sections is a nice to have feature, but shouldn't be a top priority.
Technically speaking, sven.prc is a test file with multiple inflection index sections, because each inflection is wrapped in its own group, even though they weren't strictly necessary.

KevinH · 08-31-2014, 08:04 PM

Hi Doitsu,

Indentation is critical in python, so if a line is indented too much it will actually change what the code means.

Please verify that the code immediately after the piece we have been working on

tagMap = getTagMap(controlByteCount, tagTable, data, startPos+1+textLength, endPos)

is indented to the exact same amount as the very beginning of the line:

if hordt2 is not None:

If it is indented further it will be included in the if statement when in fact we want it after the if. Your other dictionaries worked because they used ORDT tables but the two that failed do not which means the working ones took the if path but that tagMap line should be run by both ( ie. not part of the if )

That leads me to believe the indentation of that tagMap line may have been messed up during editing.

Please verify if that is the case.

If so,when I get a free moment I will clean up the code and include a new mobi_dict in the next release. I am heading out of town for a few days so I hope this does the trick. Otherwise, I will look at it when I am back and have some free time.

Take care,

KevinH

Doitsu · 09-01-2014, 06:42 AM

Hi KevinH,

Quote:

Originally Posted by KevinH

Please verify that the code immediately after the piece we have been working on

tagMap = getTagMap(controlByteCount, tagTable, data, startPos+1+textLength, endPos)

You were correct, I messed up the indentation. After correcting it, the default French and English monolingual dictionaries also unpacked fine. Without any inflections, that is.

In case you want to have another look at dictionaries with multiple inflection groups, I've created another test dictionary that contains two entries with two inflection groups and two entries with one inflection group.

This test file decompiled fine, which surprised me a bit, since I had expected that I'd get an "Error: Dictionary contains multiple inflection index sections, which is not yet supported" for my test file." message.

I'm wondering what kind of dictionary syntax actually triggers this error message.

Since the OP that started this part of the thread reported issues with Asian characters, I also tested the updated mobi_dict.py version with a Japanese test dictionary.

Unfortunately, your updated version seems to have problems with non-Latin characters.

For example, the original entry definition was:

Code:

<idx:entry name="japanese" scriptable="yes">
	<idx:orth>猫
		<idx:infl>
			<idx:iform value="貓"/>
			<idx:iform value="ねこ"/>
			<idx:iform value="ネコ"/>
		</idx:infl>
	</idx:orth><br/>
	chat (m)
</idx:entry>

and the reverse-engineered version looked like this:

Code:

<idx:entry scriptable="yes">
	<idx:orth value="s+">
		<idx:infl>
		</idx:infl>
	</idx:orth>猫<br/> 
	chat (m) 
</idx:entry>

The relevant part of the error log was:

Spoiler:

A similar problem also occurred with a Greek-English dictionary.

Kindleunpack reported: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" and wrote garbage characters in idx:orth throughout the file. For example:

Code:

<idx:orth value="しえ-À-¹-¼-*-»-µ-¹-±">

08-31-2014, 02:38 PM	#957
KevinH Sigil Developer Posts: 9,685 Karma: 6774048 Join Date: Nov 2009 Device: many	Hi Doitsu, A few things ... 1) compression has no impact on the issue so use whatever compression you like 2) your sven.prc dictionary acts very differently from kindlegen genererated ones It is point 2) that is giving me fits. In your sven.prc each character in each orth entry is replaced with a set of two byte long offsets into the ORDT table. You take each 16 bit offset and use it to look up the right unicode code point value (stored in the ORDT table) to properly build up each orth entry. Here are the first few orth entries (hex encoded) where you can see it uses two byte long offset values: 005c 005e 0008 0006 0075 007a 0066 006e 0068 0071 005c 009c 0004 005c 009c 0068 and here is the table of unicode code points that each 16 bit quantity provides an offset into: Code: (0, 37, 95, 45, 97, 111, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57 , 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 246, 228, 229, 46, 40, 41, 58, 224, 233, 33, 59, 47, 234, 63, 225, 237, 252, 44, 39, 34, 196, 337, 232, 231, 91, 214, 238, 227, 346, 347, 244, 8230, 197) In all other dictionaries I have seen (and that is only about 10 or so) each character of each orth entry is replaced with a single byte long offset that you use to look up that characters unicode code point in the ORDT table. Here are your orth entries from your Sample Dictionary again hex encoded: 03 6e 67 6b 03 67 70 7c 75 65 6a 87 05 70 70 06 74 67 6b 69 67 70 and here is the ORDT table to look these single byte offset up in : Code: (0, 37, 95, 98, 111, 97, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90) The problem is the headers that describe the ORDT tables for each dictionary show no significant differences to know how wide each offset is supposed to be. Here is the header that describes the table for sven.prc Code: Parsing metaOrthIndex ocnt 0, oentries 183, op1 1180, op2 1552, otagx 192 len C0 nul1 0 type 0 gen 2 start 458 count 1F code FDEA lng 41D total CC41 ordt 0 ligt 0 nligt 0 nctoc 0 {'count': 31, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 1112, 'nligt': 0, 'ordt': 0, 'lng': 1053, 'total': 52289, 'type': 0, 'gen': 2} and here is the header that describes the ORDT table for your sample dictionary Code: Parsing metaOrthIndex ocnt 1, oentries 149, op1 248, op2 404, otagx 192 len C6 nul1 0 type 0 gen 2 start F0 count 1 code FDEA lng 407 total 4 ordt 0 ligt 0 nligt 0 nctoc 0 {'count': 1, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 198, 'ligt': 0, 'start': 240, 'nligt': 0, 'ordt': 0, 'lng': 1031, 'total': 4, 'type': 0, 'gen': 2} So I see nothing that will let me know if the offsets used are 16 bits wide or 8 bits wide. Furthermore using 16 bit wide offsets makes NO SENSE as you could put the entire actual unicode code point in that 16 bit value so there is no savings in size at all and no reason to use any ORDT table. The ORDT table itself only has 183 entries so a single byte offset would have more than enough room to do what it needs. The whole size/space saving value comes from having 8 bit offsets into one ORDT table which holds all of the needed multibyte values. Only if more than 256 different single characters are needed would you bother to use 16 bit values but if you go that far you don't need the ORDT table approach at all. just use utf-16 encoded unicode. In addition, your sven.prc does not have the correct ocnt value. It should be 1 to indicate that an ORDT table is being used in the first place, and 0 otherwise. It was an "assert" about that that caused it to fail to unpack in the first place. To save space, the inflections are not actually stored as text. Instead common prefix and suffix rules are stored instead which are used to recreate all of the inflections from the base orth entry without having to store them. But if those orth entries are wrong, then all inflections are wrong as well and you end up with gibberish. So I can fix things so that your sample works or I can fix things so that sven works but until I can find something to indicate which is correct or how to tell them apart, I am at a loss about how to proceed. Furthermore, I think there is something wrong with the sven dictionary as it uses 16 bit offsets into a table with only 183 entries and so takes up much more room than it needs and actually wastes space. Please let me know how to proceed. I may be able to hack my way around it by checking if the length of the ORDT table is less than 256 and if so, if every other character is a zero byte, then either ignore every other character or use a 16 bit offset into what is only a small table. Here is that same snippet of code reworked to ignore every null 0 offset byte but this may cause problems to crop up if the 0 offset really should be a true character. Please try replacing that entire snippet with the following: Code: if hordt2 is not None: # print text.encode('hex') utext = u"" for x in text: off, = struct.unpack('>B', x) if off == 0: continue if off < len(hordt2): utext += unichr(hordt2[off]) else: utext += unichr(off) text = utext.encode('utf-8') This will appear to work for both cases but it is far too hackish for my likes. Please check it against other dictionaries to see how it responds to them. Please note: the mobi_dict.py code still can't deal with more than one mobi section of rules for doing inflections so if any dictionary needs more than one section it will stop from reporting any inflections. You can see this in the debug output. I am guessing this is what is happening with some dictionaries. I would really need access to such a dictionary to debug how multiple sections of inflection rules are actually used. Take care, KevinH

08-31-2014, 08:04 PM	#959
KevinH Sigil Developer Posts: 9,685 Karma: 6774048 Join Date: Nov 2009 Device: many	Hi Doitsu, Indentation is critical in python, so if a line is indented too much it will actually change what the code means. Please verify that the code immediately after the piece we have been working on tagMap = getTagMap(controlByteCount, tagTable, data, startPos+1+textLength, endPos) is indented to the exact same amount as the very beginning of the line: if hordt2 is not None: If it is indented further it will be included in the if statement when in fact we want it after the if. Your other dictionaries worked because they used ORDT tables but the two that failed do not which means the working ones took the if path but that tagMap line should be run by both ( ie. not part of the if ) That leads me to believe the indentation of that tagMap line may have been messed up during editing. Please verify if that is the case. If so,when I get a free moment I will clean up the code and include a new mobi_dict in the next release. I am heading out of town for a few days so I hope this does the trick. Otherwise, I will look at it when I am back and have some free time. Take care, KevinH Last edited by KevinH; 08-31-2014 at 08:08 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

08-23-2014, 12:38 PM	#947
KevinH Sigil Developer Posts: 9,685 Karma: 6774048 Join Date: Nov 2009 Device: many	Hi poxalew, Your patch makes good sense. I will include it in the next official release. Thanks! KevinH

08-28-2014, 10:40 AM	#949
Doitsu Grand Sorcerer Posts: 5,827 Karma: 24222221 Join Date: Dec 2010 Device: Kindle PW2	Hi Kevin, I tested the latest version of mobi_dict.py (by overwriting the same file in KindleUnpack_v073) with this Mobipocket dictionary and I got an Error: Unpacking Failed message. (The same file unpacked fine with the old mobi_dict.py version, but KindleUnpack wrote non-printable characters (hex 00, 03, 04, 05 etc.) to the .html file.) If your latest mobi_dict.py version requires a patched KindleUnpack_v073 version, can you please attach an unofficial test version that has all the required patches applied and includes your latest mobi_dict.py version?