View Single Post
Old 01-11-2008, 08:08 AM   #139
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,110
Karma: 4315826
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Nexus 7, Nexus 5, iPad 2, Kindle PW
Quote:
Originally Posted by Jaapjan View Post
True, but when you start decompressing the data at PDB record 1 (0 being the one holding the PDB 0 header, Mobi Header and EXTH header, how do you know when to end that file. For that matter, to what file does the HTML you're decoding belong to anyway. The index?
It also holds some other data like a long title and DRM stuff.

There is only one chunk of text that is compressed so you decompress all the records from record 1 to record n_document_records and n_document_records is the data you find in the PDB record.

The Perl code doing the decompression is:
Code:
                my $header = $recs->[0];
                if( defined _parse_headerrec($header) ) {
                        # a proper Doc file should be fine, but if it's not Doc
                        # compression like some Mobi docs seem to be we want to
                        # bail early. Otherwise we end up with a huge stream of
                        # substr() errors and we _still_ don't get any content.
                        eval {
                                sub min { return ($_[0]<$_[1]) ? $_[0] : $_[1] \
}
                                my $maxi = min($#$recs, $header->{'records'});
                                for( my $i = 1; $i <= $maxi; $i ++ ) {
                                                $body .= _decompress_record( $h\
eader->{'version'},
                                                        $recs->[$i]->{'data'} )\
;
                                }
                        };
                        return undef if $@;
                }


# algorithm taken from makedoc7.cpp with reference to
# http://patb.dyndns.org/Programming/PilotDoc.htm and
# http://www.pyrite.org/doc_format.html
sub _decompress_record($$) {
        my ($version,$in) = @_;
        return $in if $version == DOC_UNCOMPRESSED;

        my $out = '';

        my $lin = length $in;
        my $i = 0;
        while( $i < $lin ) {
                my $ch = substr( $in, $i ++, 1 );
                my $och = ord($ch);

                if( $och >= 1 and $och <= 8 ) {
                        # copy this many bytes... basically a way to 'escape' d\
ata
                        $out .= substr( $in, $i, $och );
                        $i += $och;
                } elsif( $och < 0x80 ) {
                        # pass through 0, 9-0x7f
                        $out .= $ch;
                } elsif( $och >= 0xc0 ) {
                        # 0xc0-0xff are 'space' plus ASCII char
                        $out .= ' ';
                        $out .= chr($och ^ 0x80);
                } else {
                        # 0x80-0xbf is sequence from already decompressed buffe\
r
                        my $nch = substr( $in, $i ++, 1 );
                        $och = ($och << 8) + ord($nch);
                        my $m = ($och & 0x3fff) >> 3;
                        my $n = ($och & 0x7) + 3;

                        # This isn't very perl-like, but a simple
                        # substr($out,$lo-$m,$n) doesn't work.
                        my $lo = length $out;
                        for( my $j = 0; $j < $n; $j ++, $lo ++ ) {
                                die "bad Doc compression" unless ($lo-$m) >= 0;
                                $out .= substr( $out, $lo-$m, 1 );
                        }
                }
        }

        return $out;
}
tompe is offline   Reply With Quote