View Single Post
Old 11-06-2012, 02:58 PM   #36
AlPe
Digital Amanuensis
AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.AlPe ought to be getting tired of karma fortunes by now.
 
AlPe's Avatar
 
Posts: 727
Karma: 1446357
Join Date: Dec 2011
Location: Turin, Italy
Device: Several eReaders and tablets
EDIT: please ignore this post. As explained below, the "strange thing" was due to the timestamping mechanism of gzip.

One strange thing that I noticed is that some of the gzipped chunks in a original Kobo dictionary, (the files ending in ".html") seems not to be gzipped with gzip or they have been altered after the compression.

I tried to uncompress a couple of them, and re-compress them with gzip and dictzip. The latter generates a file completely different than the original, so I exclude it was used by Kobo for their original files.

On the other hand, re-compressing with gzip leads in 2 cases out of 7 to a file which has some bytes different than the original file. Example:

Original file, taken from the Kobo dictionary:
Code:
$ hexdump -Cv original.si.html | head -n10
00000000  1f 8b 08 08 1d 12 67 50  00 03 73 69 2e 68 74 6d  |......gP..si.htm|
00000010  6c 00 c4 fd 4d af 24 d9  95 25 8a fd 15 2f 0e 14  |l...M.$..%.../..|
00000020  37 d0 9e 41 66 91 d5 5d  95 c1 ce 42 30 f2 43 f1  |7..Af..]...B0.C.|
00000030  94 99 cc ce 48 12 85 16  34 38 d7 dc dc ef 69 9a  |....H...48....i.|
00000040  db 71 9a b9 79 57 dd 81  40 bc 9e 14 a0 1a 74 43  |.q..yW..@.....tC|
00000050  52 a3 89 d7 0d 54 e3 01  7a 64 23 05 09 59 83 04  |R....T..zd#..Y..|
00000060  1f 34 ca 90 06 6c 02 ef  fd 06 f1 97 68 af b5 f7  |.4...l......h...|
00000070  39 76 cc dc cc af f9 cd  a8 16 50 c5 0c f7 eb ee  |9v........P.....|
00000080  66 76 3e f6 d9 1f 6b af  f5 e3 bf fc eb 7d b5 3a  |fv>...k......}.:|
00000090  95 4d eb 43 fd cf bf f7  ee b3 1f 7c 6f 55 d6 45  |.M.C.......|oU.E|
The same file, decompressed and recompressed:
Code:
$ hexdump -Cv si.html | head -n10
00000000  1f 8b 08 08 6e 6a 99 50  00 03 73 69 2e 68 74 6d  |....nj.P..si.htm|
00000010  6c 00 c4 fd 4d af 24 d9  95 25 8a fd 15 2f 0e 14  |l...M.$..%.../..|
00000020  37 d0 9e 41 66 91 d5 5d  95 c1 ce 42 30 f2 43 f1  |7..Af..]...B0.C.|
00000030  94 99 cc ce 48 12 85 16  34 38 d7 dc dc ef 69 9a  |....H...48....i.|
00000040  db 71 9a b9 79 57 dd 81  40 bc 9e 14 a0 1a 74 43  |.q..yW..@.....tC|
00000050  52 a3 89 d7 0d 54 e3 01  7a 64 23 05 09 59 83 04  |R....T..zd#..Y..|
00000060  1f 34 ca 90 06 6c 02 ef  fd 06 f1 97 68 af b5 f7  |.4...l......h...|
00000070  39 76 cc dc cc af f9 cd  a8 16 50 c5 0c f7 eb ee  |9v........P.....|
00000080  66 76 3e f6 d9 1f 6b af  f5 e3 bf fc eb 7d b5 3a  |fv>...k......}.:|
00000090  95 4d eb 43 fd cf bf f7  ee b3 1f 7c 6f 55 d6 45  |.M.C.......|oU.E|
While...
Code:
$ ls -l si.html original.si.html 
-rw-r--r-- 1 xyz xyz 80555 Nov  6 20:52 original.si.html
-rw-r--r-- 1 xyz xyz 80555 Nov  6 20:52 si.html
As you can see, the different bytes are in the "header" of the gzip file. Perhaps some particular option of gzip must be invoked when compressing. I am not sure whether this is actually a problem for the functionality of the resulting dictionary, as I have not test it yet.

Last edited by AlPe; 11-06-2012 at 03:30 PM.
AlPe is offline   Reply With Quote