08-22-2014, 06:03 PM | #946 |
Junior Member
Posts: 9
Karma: 166666
Join Date: Aug 2014
Device: Kindle PW2
|
Watermark header bug
Hi there!
First of all, thanks for the wonderful tool! Found a bug in mobi_header.py, namely: The contents of header 208 (watermark) is dumped as a string into the generated .opf. Since under certain circumstances it can contain code points outside of the range allowable in XML, in those cases the resulting .opf becomes invalid. The fix is very simple--it just involves moving 208 from id_map_strings to id_map_hexstrings. Sample patch against 0.73 below. Code:
--- mobi_header.py.orig 2014-07-14 18:32:44.000000000 +0300 +++ mobi_header.py 2014-08-23 00:50:43.312531211 +0300 @@ -71,7 +71,6 @@ 129 : 'K8_Masthead/Cover_Image_(129)', 132 : 'RegionMagnification_(132)', 200 : 'DictShortName_(200)', - 208 : 'Watermark_(208)', 501 : 'cdeType_(501)', 502 : 'last_update_time_(502)', 503 : 'Updated_Title_(503)', @@ -113,6 +112,7 @@ 404 : 'Text_to_Speech_Disabled_(404)', } id_map_hexstrings = { + 208 : 'Watermark_(208_in_hex)', 209 : 'Tamper_Proof_Keys_(209_in_hex)', 300 : 'Font_Signature_(300_in_hex)', } @@ -370,7 +370,6 @@ 129 : 'K8(129)_Masthead/Cover_Image', 132 : 'RegionMagnification', 200 : 'DictShortName', - 208 : 'Watermark', 501 : 'cdeType', 502 : 'last_update_time', 503 : 'Updated_Title', @@ -412,6 +411,7 @@ 406 : 'Rental_Indicator', } id_map_hexstrings = { + 208 : 'Watermark (hex)', 209 : 'Tamper Proof Keys (hex)', 300 : 'Font Signature (hex)', 403 : 'Unknown_(403) (hex)', Last edited by poxalew; 08-22-2014 at 06:13 PM. Reason: Clarified that the patch is against v0.73 |
08-23-2014, 12:38 PM | #947 |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi poxalew,
Your patch makes good sense. I will include it in the next official release. Thanks! KevinH |
08-26-2014, 04:29 PM | #948 | |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi lglgaigogo,
I really am not good at the dictionary stuff at all. I am not its original author. I did however make a change to mobi_dict.py to try and capture and read the ORDT table info. And although I really have no idea what the markup for dictionaries is supposed to look like, the idxrth info now looks like it is deciphering properly. Will you please download the attached mobi_dict.py.zip, unzip it and use it to replace its namesake in KindleUnpack_v073/lib/. Then try and unpack your dictionary and see if you can see any improvement and let me know what else if anything remains to be fixed. Thanks, KevinH Quote:
|
|
08-28-2014, 10:40 AM | #949 |
Grand Sorcerer
Posts: 5,635
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Hi Kevin,
I tested the latest version of mobi_dict.py (by overwriting the same file in KindleUnpack_v073) with this Mobipocket dictionary and I got an Error: Unpacking Failed message. (The same file unpacked fine with the old mobi_dict.py version, but KindleUnpack wrote non-printable characters (hex 00, 03, 04, 05 etc.) to the .html file.) If your latest mobi_dict.py version requires a patched KindleUnpack_v073 version, can you please attach an unofficial test version that has all the required patches applied and includes your latest mobi_dict.py version? |
08-29-2014, 05:49 PM | #950 | |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi Doitsu,
This new version of mobi_dict.py was based on stock 0.73. Nothing else changed. I tested it against the original posters Collin's dictionary and it helps. I will grab your dictionary and see if I can figure out what I have messed up and try to get something that works for both. Thanks! KevinH Quote:
|
|
08-29-2014, 11:41 PM | #951 |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
a better mobi_dict.py at least for your dict
Hi Doitsu,
Okay here is a version of mobi_dict.py that will work for your sven.prc dictionary. I still have to see if this will at all work for dictionaries in general. Hopefully it will. Please note, this version of mobi_dict.py has tons of debugging turned on. We will turn that off once we know everything is working. Please give it a try and let me know if it works for you. Take care, KevinH |
08-30-2014, 02:52 AM | #952 | |
Grand Sorcerer
Posts: 5,635
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
The updated version worked great with my home-made, Mobipocket Creator generated dictionary, but still seems to have problems with some Kindlegen compiled dictionaries. For example, when I decompiled the DeDRMed version of the default monolingual Oxford Dictionary of English (B003WUYRGI_EBOK.azw) for testing purposes, the reverse-engineered .html file didn't contain a single <idx:infl> tag nor multiple <idx:orth> tags, and I know for sure that this dictionary contains inflections. The script failed with the default monolingual French Kindle dictionary, (B005F12G6U_EBOK.azw). I've got the following error message: Code:
Read dictionary index data Parsing dictionary index data 12200 ocnt 0, oentries 0, op1 0, op2 0, otagx 0 parsed INDX header: len C0 nul1 0 type 1 gen 0 start E82C count 9D8 code FFFFFFFF lng FFFFFFFF total 0 ordt 0 ligt 0 nligt 0 nctoc 0 {'count': 2520, 'nctoc': 0, 'code': 4294967295L, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 59436, 'nligt': 0, 'ordt': 0, 'lng': 4294967295L, 'total': 0, 'type': 1, 'gen': 0} None None Error: unpack requires a string argument of length 0 Error: Unpacking Failed |
|
08-30-2014, 10:07 AM | #953 | |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi Doitsu,
Does the old mobi_dict create the same error or is this error directly related to the changes I made? There are simply too many struct.unpack calls to know where this might be happening without access to a test case and adding lots of extra print statements. Take care, KevinH Quote:
|
|
08-30-2014, 10:16 AM | #954 |
Grand Sorcerer
Posts: 5,635
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
|
08-30-2014, 04:21 PM | #955 |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi Doitsu,
Please try the following change in mobi_dict.py Change the following: Code:
if hordt2 is not None: utext = u"" n = len(text)/2 offsets = struct.unpack('>%dH' % n, text) for off in offsets: if off < len(hordt2): utext += unichr(hordt2[off]) else: utext += unichr(off) text = utext.encode('utf-8') Code:
if hordt2 is not None and len(text) > 0: utext = u"" n = len(text)/2 offsets = struct.unpack('>%dH' % n, text) for off in offsets: if off < len(hordt2): utext += unichr(hordt2[off]) else: utext += unichr(off) text = utext.encode('utf-8') Just the first line is changed. I think text is a null string which is freaking out struct.unpack() so we are screening this case out. Please let me know if this helps. KevinH |
08-30-2014, 07:00 PM | #956 |
Grand Sorcerer
Posts: 5,635
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Hi KevinH,
Unfortunately, changing Code:
if hordt2 is not None: Code:
if hordt2 is not None and len(text) > 0: Spoiler:
(I've got similar longer messages for many of my other test files.) I've created a small German-English proof-of-concept test dictionary based on the latest recommendations from the Kindle Publishing Guidelines for you that I compiled with both Mobigen and Kindlegen without error messages. Hopefully, this will make it easier for you to reverse-engineer the binary files, since you have access to the actual source files. The .zip file contains the source files and 3 binaries each generated without compression (c0), with standard compression (c1) and maximum compression (c2). BTW, I believe one reason why the Swedish dictionary decompiled fine, is because I used a redundant inflection syntax: I used for example: Code:
<idx:entry> <b><idx:orth>positiv <idx:infl><idx:iform value="positivs"/></idx:infl> <idx:infl><idx:iform value="positivet"/></idx:infl> <idx:infl><idx:iform value="positivets"/></idx:infl> <idx:infl><idx:iform value="positiven"/></idx:infl> <idx:infl><idx:iform value="positivens"/></idx:infl> <idx:infl><idx:iform value="positiv-"/></idx:infl> </idx:orth> </b> <i>subst</i> <br/> absolute </idx:entry> Code:
<idx:entry> <b><idx:orth>positiv <idx:infl> <idx:iform value="positivs"/> <idx:iform value="positivet"/> <idx:iform value="positivets"/> <idx:iform value="positiven"/> <idx:iform value="positivens"/> <idx:iform value="positiv-"/> </idx:infl> </idx:orth> </b> <i>subst</i> <br/> absolute </idx:entry> Last edited by Doitsu; 08-30-2014 at 07:03 PM. |
08-31-2014, 02:38 PM | #957 |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi Doitsu,
A few things ... 1) compression has no impact on the issue so use whatever compression you like 2) your sven.prc dictionary acts very differently from kindlegen genererated ones It is point 2) that is giving me fits. In your sven.prc each character in each orth entry is replaced with a set of two byte long offsets into the ORDT table. You take each 16 bit offset and use it to look up the right unicode code point value (stored in the ORDT table) to properly build up each orth entry. Here are the first few orth entries (hex encoded) where you can see it uses two byte long offset values: 005c 005e 0008 0006 0075 007a 0066 006e 0068 0071 005c 009c 0004 005c 009c 0068 and here is the table of unicode code points that each 16 bit quantity provides an offset into: Code:
(0, 37, 95, 45, 97, 111, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57 , 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 246, 228, 229, 46, 40, 41, 58, 224, 233, 33, 59, 47, 234, 63, 225, 237, 252, 44, 39, 34, 196, 337, 232, 231, 91, 214, 238, 227, 346, 347, 244, 8230, 197) Here are your orth entries from your Sample Dictionary again hex encoded: 03 6e 67 6b 03 67 70 7c 75 65 6a 87 05 70 70 06 74 67 6b 69 67 70 and here is the ORDT table to look these single byte offset up in : Code:
(0, 37, 95, 98, 111, 97, 115, 12354, 32, 12353, 12355, 12356, 12357, 12358, 12359, 12360, 12361, 12362, 12363, 12364, 12365, 12366, 12367, 12368, 12369, 12370, 12371, 12372, 12373, 12374, 12375, 12376, 12377, 12378, 12379, 12380, 12381, 12382, 12383, 12384, 12385, 12386, 12387, 12388, 12389, 12390, 12391, 12392, 12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402, 12403, 12404, 12405, 12406, 12407, 12408, 12409, 12410, 12411, 12412, 12413, 12414, 12415, 12416, 12417, 12418, 12419, 12420, 12421, 12422, 12423, 12424, 12425, 12426, 12427, 12428, 12429, 12430, 12431, 12432, 12433, 12434, 12435, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90) Here is the header that describes the table for sven.prc Code:
Parsing metaOrthIndex ocnt 0, oentries 183, op1 1180, op2 1552, otagx 192 len C0 nul1 0 type 0 gen 2 start 458 count 1F code FDEA lng 41D total CC41 ordt 0 ligt 0 nligt 0 nctoc 0 {'count': 31, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 192, 'ligt': 0, 'start': 1112, 'nligt': 0, 'ordt': 0, 'lng': 1053, 'total': 52289, 'type': 0, 'gen': 2} Code:
Parsing metaOrthIndex ocnt 1, oentries 149, op1 248, op2 404, otagx 192 len C6 nul1 0 type 0 gen 2 start F0 count 1 code FDEA lng 407 total 4 ordt 0 ligt 0 nligt 0 nctoc 0 {'count': 1, 'nctoc': 0, 'code': 65002, 'nul1': 0, 'len': 198, 'ligt': 0, 'start': 240, 'nligt': 0, 'ordt': 0, 'lng': 1031, 'total': 4, 'type': 0, 'gen': 2} Furthermore using 16 bit wide offsets makes NO SENSE as you could put the entire actual unicode code point in that 16 bit value so there is no savings in size at all and no reason to use any ORDT table. The ORDT table itself only has 183 entries so a single byte offset would have more than enough room to do what it needs. The whole size/space saving value comes from having 8 bit offsets into one ORDT table which holds all of the needed multibyte values. Only if more than 256 different single characters are needed would you bother to use 16 bit values but if you go that far you don't need the ORDT table approach at all. just use utf-16 encoded unicode. In addition, your sven.prc does not have the correct ocnt value. It should be 1 to indicate that an ORDT table is being used in the first place, and 0 otherwise. It was an "assert" about that that caused it to fail to unpack in the first place. To save space, the inflections are not actually stored as text. Instead common prefix and suffix rules are stored instead which are used to recreate all of the inflections from the base orth entry without having to store them. But if those orth entries are wrong, then all inflections are wrong as well and you end up with gibberish. So I can fix things so that your sample works or I can fix things so that sven works but until I can find something to indicate which is correct or how to tell them apart, I am at a loss about how to proceed. Furthermore, I think there is something wrong with the sven dictionary as it uses 16 bit offsets into a table with only 183 entries and so takes up much more room than it needs and actually wastes space. Please let me know how to proceed. I may be able to hack my way around it by checking if the length of the ORDT table is less than 256 and if so, if every other character is a zero byte, then either ignore every other character or use a 16 bit offset into what is only a small table. Here is that same snippet of code reworked to ignore every null 0 offset byte but this may cause problems to crop up if the 0 offset really should be a true character. Please try replacing that entire snippet with the following: Code:
if hordt2 is not None: # print text.encode('hex') utext = u"" for x in text: off, = struct.unpack('>B', x) if off == 0: continue if off < len(hordt2): utext += unichr(hordt2[off]) else: utext += unichr(off) text = utext.encode('utf-8') Please check it against other dictionaries to see how it responds to them. Please note: the mobi_dict.py code still can't deal with more than one mobi section of rules for doing inflections so if any dictionary needs more than one section it will stop from reporting any inflections. You can see this in the debug output. I am guessing this is what is happening with some dictionaries. I would really need access to such a dictionary to debug how multiple sections of inflection rules are actually used. Take care, KevinH |
08-31-2014, 07:12 PM | #958 | |||
Grand Sorcerer
Posts: 5,635
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Hi KevinH,
Quote:
Ideally, Kindleunpack should be able to decompile home-made dictionaries with inflections coded according to the Kindle Guidelines and compiled with Kindlegen, which the version that you included in you latest post definitely does. AFAIK, this is the very first version of mobi_dict.py to do so, which is no small feat! I've tested your latest version with my test dictionary and the default monolingual Kindle dictionaries. The results were as follows: 1. Test dictionary: almost perfectly reverse-engineered. The only thing missing was the spell="yes" attribute, but I'm not sure if this attribute actually does anything. 2. French dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded. 3. Spanish dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded. 4. Portuguese dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded. 6. Italian dictionary: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" message was displayed, but unpacking succeeded. 5. German dictionary: unpacking failed shortly after displaying: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" The full error log is here: Spoiler:
7. UK English dictionary: unpacking failed shortly after displaying: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" The full error log is here: Spoiler:
Quote:
Quote:
Technically speaking, sven.prc is a test file with multiple inflection index sections, because each inflection is wrapped in its own group, even though they weren't strictly necessary. |
|||
08-31-2014, 08:04 PM | #959 |
Sigil Developer
Posts: 8,099
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi Doitsu,
Indentation is critical in python, so if a line is indented too much it will actually change what the code means. Please verify that the code immediately after the piece we have been working on tagMap = getTagMap(controlByteCount, tagTable, data, startPos+1+textLength, endPos) is indented to the exact same amount as the very beginning of the line: if hordt2 is not None: If it is indented further it will be included in the if statement when in fact we want it after the if. Your other dictionaries worked because they used ORDT tables but the two that failed do not which means the working ones took the if path but that tagMap line should be run by both ( ie. not part of the if ) That leads me to believe the indentation of that tagMap line may have been messed up during editing. Please verify if that is the case. If so,when I get a free moment I will clean up the code and include a new mobi_dict in the next release. I am heading out of town for a few days so I hope this does the trick. Otherwise, I will look at it when I am back and have some free time. Take care, KevinH Last edited by KevinH; 08-31-2014 at 08:08 PM. |
09-01-2014, 06:42 AM | #960 | |
Grand Sorcerer
Posts: 5,635
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Hi KevinH,
Quote:
In case you want to have another look at dictionaries with multiple inflection groups, I've created another test dictionary that contains two entries with two inflection groups and two entries with one inflection group. This test file decompiled fine, which surprised me a bit, since I had expected that I'd get an "Error: Dictionary contains multiple inflection index sections, which is not yet supported" for my test file." message. I'm wondering what kind of dictionary syntax actually triggers this error message. Since the OP that started this part of the thread reported issues with Asian characters, I also tested the updated mobi_dict.py version with a Japanese test dictionary. Unfortunately, your updated version seems to have problems with non-Latin characters. For example, the original entry definition was: Code:
<idx:entry name="japanese" scriptable="yes"> <idx:orth>猫 <idx:infl> <idx:iform value="貓"/> <idx:iform value="ねこ"/> <idx:iform value="ネコ"/> </idx:infl> </idx:orth><br/> chat (m) </idx:entry> Code:
<idx:entry scriptable="yes">
<idx:orth value="s+">
<idx:infl>
</idx:infl>
</idx:orth>猫<br/>
chat (m)
</idx:entry>
Spoiler:
A similar problem also occurred with a Greek-English dictionary. Kindleunpack reported: "Error: Dictionary contains multiple inflection index sections, which is not yet supported" and wrote garbage characters in idx:orth throughout the file. For example: Code:
<idx:orth value="しえ-À-¹-¼-*-»-µ-¹-±">
Last edited by Doitsu; 09-01-2014 at 06:45 AM. |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can i rotate text and insert images in Mobi and EPUB? | JanGLi | Kindle Formats | 5 | 02-02-2013 04:16 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Mobi files - images | DWC | Introduce Yourself | 5 | 07-06-2011 01:43 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 12:08 PM |
Transfer of images on text files | anirudh215 | 2 | 06-22-2009 09:28 AM |