MobileRead Forums - View Single Post

AlPe · 11-06-2012, 02:58 PM

EDIT: please ignore this post. As explained below, the "strange thing" was due to the timestamping mechanism of gzip.

One strange thing that I noticed is that some of the gzipped chunks in a original Kobo dictionary, (the files ending in ".html") seems not to be gzipped with gzip or they have been altered after the compression.

I tried to uncompress a couple of them, and re-compress them with gzip and dictzip. The latter generates a file completely different than the original, so I exclude it was used by Kobo for their original files.

On the other hand, re-compressing with gzip leads in 2 cases out of 7 to a file which has some bytes different than the original file. Example:

Original file, taken from the Kobo dictionary:

Code:

$ hexdump -Cv original.si.html | head -n10
00000000  1f 8b 08 08 1d 12 67 50  00 03 73 69 2e 68 74 6d  |......gP..si.htm|
00000010  6c 00 c4 fd 4d af 24 d9  95 25 8a fd 15 2f 0e 14  |l...M.$..%.../..|
00000020  37 d0 9e 41 66 91 d5 5d  95 c1 ce 42 30 f2 43 f1  |7..Af..]...B0.C.|
00000030  94 99 cc ce 48 12 85 16  34 38 d7 dc dc ef 69 9a  |....H...48....i.|
00000040  db 71 9a b9 79 57 dd 81  40 bc 9e 14 a0 1a 74 43  |.q..yW..@.....tC|
00000050  52 a3 89 d7 0d 54 e3 01  7a 64 23 05 09 59 83 04  |R....T..zd#..Y..|
00000060  1f 34 ca 90 06 6c 02 ef  fd 06 f1 97 68 af b5 f7  |.4...l......h...|
00000070  39 76 cc dc cc af f9 cd  a8 16 50 c5 0c f7 eb ee  |9v........P.....|
00000080  66 76 3e f6 d9 1f 6b af  f5 e3 bf fc eb 7d b5 3a  |fv>...k......}.:|
00000090  95 4d eb 43 fd cf bf f7  ee b3 1f 7c 6f 55 d6 45  |.M.C.......|oU.E|

The same file, decompressed and recompressed:

Code:

$ hexdump -Cv si.html | head -n10
00000000  1f 8b 08 08 6e 6a 99 50  00 03 73 69 2e 68 74 6d  |....nj.P..si.htm|
00000010  6c 00 c4 fd 4d af 24 d9  95 25 8a fd 15 2f 0e 14  |l...M.$..%.../..|
00000020  37 d0 9e 41 66 91 d5 5d  95 c1 ce 42 30 f2 43 f1  |7..Af..]...B0.C.|
00000030  94 99 cc ce 48 12 85 16  34 38 d7 dc dc ef 69 9a  |....H...48....i.|
00000040  db 71 9a b9 79 57 dd 81  40 bc 9e 14 a0 1a 74 43  |.q..yW..@.....tC|
00000050  52 a3 89 d7 0d 54 e3 01  7a 64 23 05 09 59 83 04  |R....T..zd#..Y..|
00000060  1f 34 ca 90 06 6c 02 ef  fd 06 f1 97 68 af b5 f7  |.4...l......h...|
00000070  39 76 cc dc cc af f9 cd  a8 16 50 c5 0c f7 eb ee  |9v........P.....|
00000080  66 76 3e f6 d9 1f 6b af  f5 e3 bf fc eb 7d b5 3a  |fv>...k......}.:|
00000090  95 4d eb 43 fd cf bf f7  ee b3 1f 7c 6f 55 d6 45  |.M.C.......|oU.E|

While...

Code:

$ ls -l si.html original.si.html 
-rw-r--r-- 1 xyz xyz 80555 Nov  6 20:52 original.si.html
-rw-r--r-- 1 xyz xyz 80555 Nov  6 20:52 si.html

As you can see, the different bytes are in the "header" of the gzip file. Perhaps some particular option of gzip must be invoked when compressing. I am not sure whether this is actually a problem for the functionality of the resulting dictionary, as I have not test it yet.