EDIT:
please ignore this post. As explained below, the "strange thing" was due to the timestamping mechanism of gzip.
One strange thing that I noticed is that some of the gzipped chunks in a original Kobo dictionary, (the files ending in ".html") seems not to be gzipped with gzip or they have been altered after the compression.
I tried to uncompress a couple of them, and re-compress them with gzip and dictzip. The latter generates a file completely different than the original, so I exclude it was used by Kobo for their original files.
On the other hand, re-compressing with gzip leads in 2 cases out of 7 to a file which has some bytes different than the original file. Example:
Original file, taken from the Kobo dictionary:
Code:
$ hexdump -Cv original.si.html | head -n10
00000000 1f 8b 08 08 1d 12 67 50 00 03 73 69 2e 68 74 6d |......gP..si.htm|
00000010 6c 00 c4 fd 4d af 24 d9 95 25 8a fd 15 2f 0e 14 |l...M.$..%.../..|
00000020 37 d0 9e 41 66 91 d5 5d 95 c1 ce 42 30 f2 43 f1 |7..Af..]...B0.C.|
00000030 94 99 cc ce 48 12 85 16 34 38 d7 dc dc ef 69 9a |....H...48....i.|
00000040 db 71 9a b9 79 57 dd 81 40 bc 9e 14 a0 1a 74 43 |.q..yW..@.....tC|
00000050 52 a3 89 d7 0d 54 e3 01 7a 64 23 05 09 59 83 04 |R....T..zd#..Y..|
00000060 1f 34 ca 90 06 6c 02 ef fd 06 f1 97 68 af b5 f7 |.4...l......h...|
00000070 39 76 cc dc cc af f9 cd a8 16 50 c5 0c f7 eb ee |9v........P.....|
00000080 66 76 3e f6 d9 1f 6b af f5 e3 bf fc eb 7d b5 3a |fv>...k......}.:|
00000090 95 4d eb 43 fd cf bf f7 ee b3 1f 7c 6f 55 d6 45 |.M.C.......|oU.E|
The same file, decompressed and recompressed:
Code:
$ hexdump -Cv si.html | head -n10
00000000 1f 8b 08 08 6e 6a 99 50 00 03 73 69 2e 68 74 6d |....nj.P..si.htm|
00000010 6c 00 c4 fd 4d af 24 d9 95 25 8a fd 15 2f 0e 14 |l...M.$..%.../..|
00000020 37 d0 9e 41 66 91 d5 5d 95 c1 ce 42 30 f2 43 f1 |7..Af..]...B0.C.|
00000030 94 99 cc ce 48 12 85 16 34 38 d7 dc dc ef 69 9a |....H...48....i.|
00000040 db 71 9a b9 79 57 dd 81 40 bc 9e 14 a0 1a 74 43 |.q..yW..@.....tC|
00000050 52 a3 89 d7 0d 54 e3 01 7a 64 23 05 09 59 83 04 |R....T..zd#..Y..|
00000060 1f 34 ca 90 06 6c 02 ef fd 06 f1 97 68 af b5 f7 |.4...l......h...|
00000070 39 76 cc dc cc af f9 cd a8 16 50 c5 0c f7 eb ee |9v........P.....|
00000080 66 76 3e f6 d9 1f 6b af f5 e3 bf fc eb 7d b5 3a |fv>...k......}.:|
00000090 95 4d eb 43 fd cf bf f7 ee b3 1f 7c 6f 55 d6 45 |.M.C.......|oU.E|
While...
Code:
$ ls -l si.html original.si.html
-rw-r--r-- 1 xyz xyz 80555 Nov 6 20:52 original.si.html
-rw-r--r-- 1 xyz xyz 80555 Nov 6 20:52 si.html
As you can see, the different bytes are in the "header" of the gzip file. Perhaps some particular option of gzip must be invoked when compressing. I am not sure whether this is actually a problem for the functionality of the resulting dictionary, as I have not test it yet.