MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

Doitsu · 08-31-2014, 07:12 PM

Hi KevinH,

Quote:

Originally Posted by KevinH

So I can fix things so that your sample works or I can fix things so that sven works but until I can find something to indicate which is correct or how to tell them apart, I am at a loss about how to proceed.
[...]
Furthermore, I think there is something wrong with the sven dictionary as it uses 16 bit offsets into a table with only 183 entries and so takes up much more room than it needs and actually wastes space.

Sínce the source code of sven.prc wasn't coded according to Amazon's guidelines, it's no wonder it's wasting space; you shouldn't waste your time making Kindleunpack work with it. (I merely mentioned it, because I created it myself and had the source code for it.)

Ideally, Kindleunpack should be able to decompile home-made dictionaries with inflections coded according to the Kindle Guidelines and compiled with Kindlegen, which the version that you included in you latest post definitely does.

AFAIK, this is the very first version of mobi_dict.py to do so, which is no small feat!

I've tested your latest version with my test dictionary and the default monolingual Kindle dictionaries.

The results were as follows:

1. Test dictionary: almost perfectly reverse-engineered. The only thing missing was the spell="yes" attribute, but I'm not sure if this attribute actually does anything.

2. French dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

3. Spanish dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

4. Portuguese dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

6. Italian dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded.

5. German dictionary: unpacking failed shortly after displaying: "Error: Dictionary contains multiple inflection index sections, which is not yet supported"

The full error log is here:

Spoiler:

Code:

Info: Document contains orthographic index, handle as dictionary
Parsing metaInflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C0 nul1 2 type 0 gen 2 start FC count 4 code 4E4 lng FFFFFFFF total 36FB ordt 0 ligt 0 nligt 0 nctoc 1
{'count': 4, 'nctoc': 1, 'code': 1252, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 252, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 14075, 'type': 0, 'gen': 2} None None
Error: Dictionary contains multiple inflection index sections, which is not yet supported
Parsing inflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 2 type 1 gen 0 start DC54 count FD0 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 4048, 'nctoc': 0, 'code': 4294967295L, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 56404, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
inflectionTagTable: [(7, 2, 3, 0), (0, 0, 0, 1)]
Error: Dictionary uses obsolete inflection rule scheme which is not yet supported
Parsing metaOrthIndex
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C3 nul1 0 type 0 gen 2 start 440 count 3C code 4E4 lng 407 total 23F91 ordt 4BC ligt 5C0 nligt 5 nctoc 1
{'count': 60, 'nctoc': 1, 'code': 1252, 'nul1': 0, 'len': 195, 'ligt': 1472, 'start': 1088, 'nligt': 5, 'ordt': 1212, 'lng': 1031, 'total': 147345, 'type': 0, 'gen': 2} None None
orthIndexCount is 60
orthTagTable: [(1, 1, 1, 0), (2, 1, 2, 0), (5, 1, 4, 0), (22, 1, 8, 0), (25, 1, 16, 0), (8, 2, 32, 0), (69, 1, 192, 0), (0, 0, 0, 1)]
Read dictionary index data
Parsing dictionary index data 15627
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 0 type 1 gen 0 start E68C count AB4 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 2740, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 59020, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
Error: local variable 'tagMap' referenced before assignment


Error: Unpacking Failed

7. UK English dictionary: unpacking failed shortly after displaying: "Error: Dictionary contains multiple inflection index sections, which is not yet supported"

The full error log is here:

Spoiler:

Code:

Info: Document contains orthographic index, handle as dictionary
Parsing metaInflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C0 nul1 2 type 0 gen 2 start DC count 1 code 4E4 lng FFFFFFFF total 463 ordt 0 ligt 0 nligt 0 nctoc 1
{'count': 1, 'nctoc': 1, 'code': 1252, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 220, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 1123, 'type': 0, 'gen': 2} None None
Parsing inflIndexData
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 2 type 1 gen 0 start 3B98 count 463 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 1123, 'nctoc': 0, 'code': 4294967295L, 'nul1': 2, 'len': 192, 'ligt': 0, 'start': 15256, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
inflectionTagTable: [(7, 2, 3, 0), (0, 0, 0, 1)]
Error: Dictionary uses obsolete inflection rule scheme which is not yet supported
Parsing metaOrthIndex
ocnt 0, oentries 0, op1 0, op2 0, otagx 192
parsed INDX header:
len C8 nul1 0 type 0 gen 2 start 2C8 count 28 code 4E4 lng 809 total 23899 ordt 31C ligt 420 nligt 5 nctoc 1
{'count': 40, 'nctoc': 1, 'code': 1252, 'nul1': 0, 'len': 200, 'ligt': 1056, 'start': 712, 'nligt': 5, 'ordt': 796, 'lng': 2057, 'total': 145561, 'type': 0, 'gen': 2} None None
orthIndexCount is 40
orthTagTable: [(1, 1, 1, 0), (5, 1, 2, 0), (22, 1, 4, 0), (25, 1, 8, 0), (69, 1, 16, 0), (0, 0, 0, 1)]
Info: Index doesn't contain entry length tags
Read dictionary index data
Parsing dictionary index data 16740
ocnt 0, oentries 0, op1 0, op2 0, otagx 0
parsed INDX header:
len C0 nul1 0 type 1 gen 0 start DD18 count F69 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0
{'count': 3945, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 56600, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None
Error: local variable 'tagMap' referenced before assignment


Error: Unpacking Failed

Quote:

Originally Posted by KevinH

Please let me know how to proceed. I may be able to hack my way around it by checking if the length of the ORDT table is less than 256 and if so, if every other character is a zero byte, then either ignore every other character or use a 16 bit offset into what is only a small table.

IMHO, your latest version is almost perfect. All that's left is figuring out why the German and UK English dictionaries failed to unpack. Once that's done, you'll only need to remove the debug code and your code is ready for pre-release testing.

Quote:

Originally Posted by KevinH

Please note: the mobi_dict.py code still can't deal with more than one mobi section of rules for doing inflections so if any dictionary needs more than one section it will stop from reporting any inflections. You can see this in the debug output. I am guessing this is what is happening with some dictionaries. I would really need access to such a dictionary to debug how multiple sections of inflection rules are actually used.

IMHO, support for multiple inflection index sections is a nice to have feature, but shouldn't be a top priority.
Technically speaking, sven.prc is a test file with multiple inflection index sections, because each inflection is wrapped in its own group, even though they weren't strictly necessary.