MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 08-31-2014, 02:38 PM

Hi Doitsu,

A few things ...

1) compression has no impact on the issue so use whatever compression you like

2) your sven.prc dictionary acts very differently from kindlegen genererated ones

It is point 2) that is giving me fits.

In your sven.prc each character in each orth entry is replaced with a set of two byte long offsets into the ORDT table. You take each 16 bit offset and use it to look up the right unicode code point value (stored in the ORDT table) to properly build up each orth entry.

Here are the first few orth entries (hex encoded) where you can see it uses two byte long offset values:

005c 005e 0008 0006 0075 007a 0066 006e 0068 0071
005c 009c 0004
005c 009c 0068

and here is the table of unicode code points that each 16 bit quantity provides an offset into:

Code:

(0, 37, 95, 45, 97, 111, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57
, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 246, 228, 229, 46, 40, 41, 58, 224, 233, 33, 59, 47, 234, 63, 225, 237, 252, 44, 39, 34, 196, 337, 232, 231, 91, 214, 238, 227, 346, 347, 244, 8230, 197)

In **all** other dictionaries I have seen (and that is only about 10 or so) each character of each orth entry is replaced with a *single* byte long offset that you use to look up that characters unicode code point in the ORDT table.

Here are your orth entries from your Sample Dictionary again hex encoded:

03 6e 67 6b 03 67 70
7c 75 65 6a
87 05 70 70
06 74 67 6b 69 67 70

and here is the ORDT table to look these single byte offset up in :

Code:

(0, 37, 95, 98, 111, 97, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90)

The problem is the headers that describe the ORDT tables for each dictionary show no significant differences to know how wide each offset is supposed to be.

Here is the header that describes the table for sven.prc

Code:

Parsing metaOrthIndex
ocnt 0, oentries 183, op1 1180, op2 1552, otagx 192

len C0 nul1 0 type 0 gen 2 start 458 count 1F code FDEA lng 41D total CC41 ordt 0 ligt 0 nligt 0 nctoc 0

{'count': 31, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 1112, 'nligt': 0, 'ordt': 0, 'lng': 1053, 'total': 52289, 'type': 0, 'gen': 2}

and here is the header that describes the ORDT table for your sample dictionary

Code:

Parsing metaOrthIndex
ocnt 1, oentries 149, op1 248, op2 404, otagx 192

len C6 nul1 0 type 0 gen 2 start F0 count 1 code FDEA lng 407 total 4 ordt 0 ligt 0 nligt 0 nctoc 0

{'count': 1, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 198, 'ligt': 0, 'start': 240, 'nligt': 0, 'ordt': 0, 'lng': 1031, 'total': 4, 'type': 0, 'gen': 2}

So I see nothing that will let me know if the offsets used are 16 bits wide or 8 bits wide.

Furthermore using 16 bit wide offsets makes NO SENSE as you could put the entire actual unicode code point in that 16 bit value so there is no savings in size at all and no reason to use any ORDT table. The ORDT table itself only has 183 entries so a single byte offset would have more than enough room to do what it needs.

The whole size/space saving value comes from having 8 bit offsets into one ORDT table which holds all of the needed multibyte values. Only if more than 256 different single characters are needed would you bother to use 16 bit values but if you go that far you don't need the ORDT table approach at all. just use utf-16 encoded unicode.

In addition, your sven.prc does not have the correct ocnt value. It should be 1 to indicate that an ORDT table is being used in the first place, and 0 otherwise. It was an "assert" about that that caused it to fail to unpack in the first place.

To save space, the inflections are not actually stored as text. Instead common prefix and suffix rules are stored instead which are used to recreate all of the inflections from the base orth entry without having to store them.

But if those orth entries are wrong, then all inflections are wrong as well and you end up with gibberish.

So I can fix things so that your sample works or I can fix things so that sven works but until I can find something to indicate which is correct or how to tell them apart, I am at a loss about how to proceed. Furthermore, I think there is something wrong with the sven dictionary as it uses 16 bit offsets into a table with only 183 entries and so takes up much more room than it needs and actually wastes space.

Please let me know how to proceed. I may be able to hack my way around it by checking if the length of the ORDT table is less than 256 and if so, if every other character is a zero byte, then either ignore every other character or use a 16 bit offset into what is only a small table.

Here is that same snippet of code reworked to ignore every null 0 offset byte but this may cause problems to crop up if the 0 offset really should be a true character.

Please try replacing that entire snippet with the following:

Code:

                    if hordt2 is not None:
                        # print text.encode('hex')
                        utext = u""
                        for x in text:
                            off, = struct.unpack('>B', x)
                            if off == 0:
                                continue
                            if off < len(hordt2):
                                utext += unichr(hordt2[off])
                            else:
                                utext += unichr(off)
                        text = utext.encode('utf-8')

This will appear to work for both cases but it is far too hackish for my likes.
Please check it against other dictionaries to see how it responds to them.

Please note: the mobi_dict.py code still can't deal with more than one mobi section of rules for doing inflections so if any dictionary needs more than one section it will stop from reporting any inflections. You can see this in the debug output. I am guessing this is what is happening with some dictionaries. I would really need access to such a dictionary to debug how multiple sections of inflection rules are actually used.

Take care,

KevinH

08-31-2014, 02:38 PM	#957
KevinH Sigil Developer Posts: 7,654 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Doitsu, A few things ... 1) compression has no impact on the issue so use whatever compression you like 2) your sven.prc dictionary acts very differently from kindlegen genererated ones It is point 2) that is giving me fits. In your sven.prc each character in each orth entry is replaced with a set of two byte long offsets into the ORDT table. You take each 16 bit offset and use it to look up the right unicode code point value (stored in the ORDT table) to properly build up each orth entry. Here are the first few orth entries (hex encoded) where you can see it uses two byte long offset values: 005c 005e 0008 0006 0075 007a 0066 006e 0068 0071 005c 009c 0004 005c 009c 0068 and here is the table of unicode code points that each 16 bit quantity provides an offset into: Code: (0, 37, 95, 45, 97, 111, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57 , 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 246, 228, 229, 46, 40, 41, 58, 224, 233, 33, 59, 47, 234, 63, 225, 237, 252, 44, 39, 34, 196, 337, 232, 231, 91, 214, 238, 227, 346, 347, 244, 8230, 197) In all other dictionaries I have seen (and that is only about 10 or so) each character of each orth entry is replaced with a single byte long offset that you use to look up that characters unicode code point in the ORDT table. Here are your orth entries from your Sample Dictionary again hex encoded: 03 6e 67 6b 03 67 70 7c 75 65 6a 87 05 70 70 06 74 67 6b 69 67 70 and here is the ORDT table to look these single byte offset up in : Code: (0, 37, 95, 98, 111, 97, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90) The problem is the headers that describe the ORDT tables for each dictionary show no significant differences to know how wide each offset is supposed to be. Here is the header that describes the table for sven.prc Code: Parsing metaOrthIndex ocnt 0, oentries 183, op1 1180, op2 1552, otagx 192 len C0 nul1 0 type 0 gen 2 start 458 count 1F code FDEA lng 41D total CC41 ordt 0 ligt 0 nligt 0 nctoc 0 {'count': 31, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 1112, 'nligt': 0, 'ordt': 0, 'lng': 1053, 'total': 52289, 'type': 0, 'gen': 2} and here is the header that describes the ORDT table for your sample dictionary Code: Parsing metaOrthIndex ocnt 1, oentries 149, op1 248, op2 404, otagx 192 len C6 nul1 0 type 0 gen 2 start F0 count 1 code FDEA lng 407 total 4 ordt 0 ligt 0 nligt 0 nctoc 0 {'count': 1, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 198, 'ligt': 0, 'start': 240, 'nligt': 0, 'ordt': 0, 'lng': 1031, 'total': 4, 'type': 0, 'gen': 2} So I see nothing that will let me know if the offsets used are 16 bits wide or 8 bits wide. Furthermore using 16 bit wide offsets makes NO SENSE as you could put the entire actual unicode code point in that 16 bit value so there is no savings in size at all and no reason to use any ORDT table. The ORDT table itself only has 183 entries so a single byte offset would have more than enough room to do what it needs. The whole size/space saving value comes from having 8 bit offsets into one ORDT table which holds all of the needed multibyte values. Only if more than 256 different single characters are needed would you bother to use 16 bit values but if you go that far you don't need the ORDT table approach at all. just use utf-16 encoded unicode. In addition, your sven.prc does not have the correct ocnt value. It should be 1 to indicate that an ORDT table is being used in the first place, and 0 otherwise. It was an "assert" about that that caused it to fail to unpack in the first place. To save space, the inflections are not actually stored as text. Instead common prefix and suffix rules are stored instead which are used to recreate all of the inflections from the base orth entry without having to store them. But if those orth entries are wrong, then all inflections are wrong as well and you end up with gibberish. So I can fix things so that your sample works or I can fix things so that sven works but until I can find something to indicate which is correct or how to tell them apart, I am at a loss about how to proceed. Furthermore, I think there is something wrong with the sven dictionary as it uses 16 bit offsets into a table with only 183 entries and so takes up much more room than it needs and actually wastes space. Please let me know how to proceed. I may be able to hack my way around it by checking if the length of the ORDT table is less than 256 and if so, if every other character is a zero byte, then either ignore every other character or use a 16 bit offset into what is only a small table. Here is that same snippet of code reworked to ignore every null 0 offset byte but this may cause problems to crop up if the 0 offset really should be a true character. Please try replacing that entire snippet with the following: Code: if hordt2 is not None: # print text.encode('hex') utext = u"" for x in text: off, = struct.unpack('>B', x) if off == 0: continue if off < len(hordt2): utext += unichr(hordt2[off]) else: utext += unichr(off) text = utext.encode('utf-8') This will appear to work for both cases but it is far too hackish for my likes. Please check it against other dictionaries to see how it responds to them. Please note: the mobi_dict.py code still can't deal with more than one mobi section of rules for doing inflections so if any dictionary needs more than one section it will stop from reporting any inflections. You can see this in the debug output. I am guessing this is what is happening with some dictionaries. I would really need access to such a dictionary to debug how multiple sections of inflection rules are actually used. Take care, KevinH