Quote:
Originally Posted by Jaapjan
True, but when you start decompressing the data at PDB record 1 (0 being the one holding the PDB 0 header, Mobi Header and EXTH header, how do you know when to end that file. For that matter, to what file does the HTML you're decoding belong to anyway. The index?
|
It also holds some other data like a long title and DRM stuff.
There is only one chunk of text that is compressed so you decompress all the records from record 1 to record n_document_records and n_document_records is the data you find in the PDB record.
The Perl code doing the decompression is:
Code:
my $header = $recs->[0];
if( defined _parse_headerrec($header) ) {
# a proper Doc file should be fine, but if it's not Doc
# compression like some Mobi docs seem to be we want to
# bail early. Otherwise we end up with a huge stream of
# substr() errors and we _still_ don't get any content.
eval {
sub min { return ($_[0]<$_[1]) ? $_[0] : $_[1] \
}
my $maxi = min($#$recs, $header->{'records'});
for( my $i = 1; $i <= $maxi; $i ++ ) {
$body .= _decompress_record( $h\
eader->{'version'},
$recs->[$i]->{'data'} )\
;
}
};
return undef if $@;
}
# algorithm taken from makedoc7.cpp with reference to
# http://patb.dyndns.org/Programming/PilotDoc.htm and
# http://www.pyrite.org/doc_format.html
sub _decompress_record($$) {
my ($version,$in) = @_;
return $in if $version == DOC_UNCOMPRESSED;
my $out = '';
my $lin = length $in;
my $i = 0;
while( $i < $lin ) {
my $ch = substr( $in, $i ++, 1 );
my $och = ord($ch);
if( $och >= 1 and $och <= 8 ) {
# copy this many bytes... basically a way to 'escape' d\
ata
$out .= substr( $in, $i, $och );
$i += $och;
} elsif( $och < 0x80 ) {
# pass through 0, 9-0x7f
$out .= $ch;
} elsif( $och >= 0xc0 ) {
# 0xc0-0xff are 'space' plus ASCII char
$out .= ' ';
$out .= chr($och ^ 0x80);
} else {
# 0x80-0xbf is sequence from already decompressed buffe\
r
my $nch = substr( $in, $i ++, 1 );
$och = ($och << 8) + ord($nch);
my $m = ($och & 0x3fff) >> 3;
my $n = ($och & 0x7) + 3;
# This isn't very perl-like, but a simple
# substr($out,$lo-$m,$n) doesn't work.
my $lo = length $out;
for( my $j = 0; $j < $n; $j ++, $lo ++ ) {
die "bad Doc compression" unless ($lo-$m) >= 0;
$out .= substr( $out, $lo-$m, 1 );
}
}
}
return $out;
}