Questions Palmdoc (MOBI) compressor implementation, improvements maybe?

mariushm · 10-05-2019, 06:46 PM

Hello,

I've written a Palmdoc compressor from scratch in PHP using the information on this website and following the Calibre Palmdoc routines (both the python and c versions)

There are some things I'm puzzled about in the Calibre code and I may also have some suggestions to speed up the compression.

To expedite things, some links to information that you may want to access:

Palmdoc compression algorithm : https://wiki.mobileread.com/wiki/Pal...ir_compression

Github folder for the source code: https://github.com/kovidgoyal/calibr...ks/compression

From this point, I'm going to refer only to palmdoc.c file in the above github folder.

At line 123 there's this:

Code:

   if ( i > 10 && (b->len - i) > 10) {

So if it's not clear, a sequence of 3 to 10 bytes can be encoded as two bytes by the algorithm, if it's found in the already parsed data.
Byte pair is distance [2 bits : 10][11 bits: 1..2047][3 bits : sequence length]

It seems to me like this condition is correct but too restrictive, only because further down on line 125, the sequence length is "hardwired" to 10 from the start, so there must always be at lest 9 bytes after current position otherwise code won't work.
Let's say I have this text:
"xSmithSmith" - this should compress to xSmith[5,2] (go back 5 bytes, copy 2+3 = 5 bytes)
but if I understand correctly, the text would not compress because there's less than 10 bytes left when reaching the second "Smith"

Also, I don't think i has to be bigger than 10, but seems like it was hardcoded to a less optimal value because of the rfind function. (line 103)

The only conditions I see would be i>1 (because the minimum distance for pair encoding is 1 byte backwards) and i < length -3 (because minimum length is 3)

For example, if I have this text : AAAA - I think it's valid to write it A[1,0] - 3 bytes. Or AJOHNJOHN should be saved as AJOHN[4,1] saving 2 bytes.
With the palmdoc.c code it would not compress because i is not bigger than 10 and also i is not 10 less than length.

Are there some compatibility issues with some decoders out there which make it better to not allow distance to be less than 10?

Also, it seems to me like the rfind function is very unoptimized.
* You're always going from i - sequence length [10 down to 3] when it really should from i-1 (to compress repeated characters like ------- or "zzzz..."
* It's always going backwards to 0 , when it costs very little to compute where the limit is before the for ... you can only go back 2047 bytes.
If you're compressing bytes from let's say 3000 bytes to 4096 bytes, you're gonna waste cpu cycles testing 1000-2000 bytes that can't be used if a result is found, because the distance is higher than 2047.

In my version, I'm not even scanning the whole buffer, from my current position to 0.
As I know the maximum size of the data to be compressed is 4096, it's possible to build an index of where each byte on the fly, as I advance through the buffer trying to compress bytes.
Basically I have an array[0..255], where each element of the array is another array that holds the positions where the character was found.

When I want to search for 3..10 byte sequence in the data that was already parsed, I can simply look in this array and check only those that are within 2047 bytes from my current position in the buffer. I'm saving a lot of byte compares.

The amount of cpu cycles used updating this index is much less than backwards comparing tens of thousands of times the data... and the extra memory usage for the index array is minimal, a few KB worth of memory.

Maybe I'm silly and looked at some outdated code, or this bit of code was tested as working even though it's not producing the most compressed data possible (suboptimal compression) ... but I'd like to know if these issues were intentional or not. I'd like my compressor to produce most compatible compressed text.

Thank you all for any answers,
Marius

Here's my code for these rfind and compare for comparison... it works much much faster:
If there's some issues about "inspiring" from the below code to improve Calibre's... consider the code below 100%, public domain or whatever's convenient for you guys.

PHP Code:


			
// functions extracted from a bigger class I'm working on



// every time I advance in buffer, update the index array and put the position of this byte in array 

    private function nextByte() {

        $this->i++;

        if ($this->i < $this->blen) {

            $this->c = ord($this->buff[$this->i]); // ascii code of character

            if (isset($this->map[$this->c])==false) $this->map[$this->c] = array();

            array_push($this->map[$this->c],$this->i);

        }



    }





   private function palmdoc_compare_fast($compare_start,$compareto_start,$length) {

        $count = 1;

        for ($i=1;$i<$length;$i++) {

            if ($this->buff[$compare_start+$i] != $this->buff[$compareto_start+$i]) return $count;

            $count++;

        }

        return $count;

    }





    private function palmdoc_rsearch_fast($maxLength) {

        $count = count($this->map[$this->c]);

        if ($count==1) return false; // only our character indexed in map, so quit early



        $result = array(0=>0,1=>0);

        $scanLimit = $this->i-2047; if ($scanLimit<0) $scanLimit = 0;

        for ($k=$count-2;$k>-1;$k--) { // go through occurrences of this character backwards

            if ($this->map[$this->c][$k]<$scanLimit) break; // outside of backwards limit, stop searching

            $length = $this->palmdoc_compare_fast($this->map[$this->c][$k],$this->i,$maxLength);



            if (($length>$result[1]) && ($length>2)) {

                $result[0] = $this->i-$this->map[$this->c][$k];

                $result[1] = $length;

            }

            if ($length==10) break;  // found a maximum possible match, no point searching further

        }



        if ($result[1]==0) return false;

        return $result;

    }



   function compress_fast(string $inBuffer) {

        if ($inBuffer==='') return '';

        $this->buff = $inBuffer;

        $this->blen = strlen($inBuffer);

        $this->map = array();

        $this->lastError = 0;

        $this->output = '';

        $this->i = -1;

        $this->nextByte();

        $output = '';

        while ($this->i < $this->blen) {

            $result = false;

            // Check if a sequence of 3..10 bytes is already decoded, in the previous 1..2047 bytes

            // Note: Checks are required to be able to compress sequences like #### = #[1:3]

            if (($this->i>0) && ($this->blen>3) && ($this->i<=($this->blen-3))) {

                $maxLength = $this->blen-$this->i; if ($maxLength>10) $maxLength=10;

                if ($maxLength>2) {

                    $result = $this->palmdoc_rsearch_fast($maxLength);

                    if ($result!==false) {

                        $bytes = 0x8000 + ($result[0]<<3) + $result[1]-3;

                        $output .= chr($bytes >> 8) . chr($bytes % 256);

                        //for ($j=1;$j<$result[1];$j++) $segments[$i+$j]['type'] = 'x';

                        //$this->i = $this->i + $result[1];

                        for ($j=0;$j<$result[1];$j++) $this->nextByte();

                        $result = true;

                    } 

                }

            }



// [ ....... ]  == cut out the rest 



}

kovidgoyal · 10-05-2019, 08:45 PM

Sorry, I'm kinda busy right now, dealing with v4 related things. As I recall, the C version is just a simple clone of the python version and can definitely stand to be optimized further, although in practice compression speed is not really a bottleneck.

10-05-2019, 06:46 PM	#1
mariushm Junior Member Posts: 2 Karma: 10 Join Date: Oct 2019 Device: none	Questions Palmdoc (MOBI) compressor implementation, improvements maybe? Hello, I've written a Palmdoc compressor from scratch in PHP using the information on this website and following the Calibre Palmdoc routines (both the python and c versions) There are some things I'm puzzled about in the Calibre code and I may also have some suggestions to speed up the compression. To expedite things, some links to information that you may want to access: Palmdoc compression algorithm : https://wiki.mobileread.com/wiki/Pal...ir_compression Github folder for the source code: https://github.com/kovidgoyal/calibr...ks/compression From this point, I'm going to refer only to palmdoc.c file in the above github folder. At line 123 there's this: Code: if ( i > 10 && (b->len - i) > 10) { So if it's not clear, a sequence of 3 to 10 bytes can be encoded as two bytes by the algorithm, if it's found in the already parsed data. Byte pair is distance [2 bits : 10][11 bits: 1..2047][3 bits : sequence length] It seems to me like this condition is correct but too restrictive, only because further down on line 125, the sequence length is "hardwired" to 10 from the start, so there must always be at lest 9 bytes after current position otherwise code won't work. Let's say I have this text: "xSmithSmith" - this should compress to xSmith[5,2] (go back 5 bytes, copy 2+3 = 5 bytes) but if I understand correctly, the text would not compress because there's less than 10 bytes left when reaching the second "Smith" Also, I don't think i has to be bigger than 10, but seems like it was hardcoded to a less optimal value because of the rfind function. (line 103) The only conditions I see would be i>1 (because the minimum distance for pair encoding is 1 byte backwards) and i < length -3 (because minimum length is 3) For example, if I have this text : AAAA - I think it's valid to write it A[1,0] - 3 bytes. Or AJOHNJOHN should be saved as AJOHN[4,1] saving 2 bytes. With the palmdoc.c code it would not compress because i is not bigger than 10 and also i is not 10 less than length. Are there some compatibility issues with some decoders out there which make it better to not allow distance to be less than 10? Also, it seems to me like the rfind function is very unoptimized. * You're always going from i - sequence length [10 down to 3] when it really should from i-1 (to compress repeated characters like ------- or "zzzz..." * It's always going backwards to 0 , when it costs very little to compute where the limit is before the for ... you can only go back 2047 bytes. If you're compressing bytes from let's say 3000 bytes to 4096 bytes, you're gonna waste cpu cycles testing 1000-2000 bytes that can't be used if a result is found, because the distance is higher than 2047. In my version, I'm not even scanning the whole buffer, from my current position to 0. As I know the maximum size of the data to be compressed is 4096, it's possible to build an index of where each byte on the fly, as I advance through the buffer trying to compress bytes. Basically I have an array[0..255], where each element of the array is another array that holds the positions where the character was found. When I want to search for 3..10 byte sequence in the data that was already parsed, I can simply look in this array and check only those that are within 2047 bytes from my current position in the buffer. I'm saving a lot of byte compares. The amount of cpu cycles used updating this index is much less than backwards comparing tens of thousands of times the data... and the extra memory usage for the index array is minimal, a few KB worth of memory. Maybe I'm silly and looked at some outdated code, or this bit of code was tested as working even though it's not producing the most compressed data possible (suboptimal compression) ... but I'd like to know if these issues were intentional or not. I'd like my compressor to produce most compatible compressed text. Thank you all for any answers, Marius Here's my code for these rfind and compare for comparison... it works much much faster: If there's some issues about "inspiring" from the below code to improve Calibre's... consider the code below 100%, public domain or whatever's convenient for you guys. PHP Code: // functions extracted from a bigger class I'm working on // every time I advance in buffer, update the index array and put the position of this byte in array private function nextByte() { $this->i++; if ($this->i < $this->blen) { $this->c = ord($this->buff[$this->i]); // ascii code of character if (isset($this->map[$this->c])==false) $this->map[$this->c] = array(); array_push($this->map[$this->c],$this->i); } } private function palmdoc_compare_fast($compare_start,$compareto_start,$length) { $count = 1; for ($i=1;$i<$length;$i++) { if ($this->buff[$compare_start+$i] != $this->buff[$compareto_start+$i]) return $count; $count++; } return $count; } private function palmdoc_rsearch_fast($maxLength) { $count = count($this->map[$this->c]); if ($count==1) return false; // only our character indexed in map, so quit early $result = array(0=>0,1=>0); $scanLimit = $this->i-2047; if ($scanLimit<0) $scanLimit = 0; for ($k=$count-2;$k>-1;$k--) { // go through occurrences of this character backwards if ($this->map[$this->c][$k]<$scanLimit) break; // outside of backwards limit, stop searching $length = $this->palmdoc_compare_fast($this->map[$this->c][$k],$this->i,$maxLength); if (($length>$result[1]) && ($length>2)) { $result[0] = $this->i-$this->map[$this->c][$k]; $result[1] = $length; } if ($length==10) break; // found a maximum possible match, no point searching further } if ($result[1]==0) return false; return $result; } function compress_fast(string $inBuffer) { if ($inBuffer==='') return ''; $this->buff = $inBuffer; $this->blen = strlen($inBuffer); $this->map = array(); $this->lastError = 0; $this->output = ''; $this->i = -1; $this->nextByte(); $output = ''; while ($this->i < $this->blen) { $result = false; // Check if a sequence of 3..10 bytes is already decoded, in the previous 1..2047 bytes // Note: Checks are required to be able to compress sequences like #### = #[1:3] if (($this->i>0) && ($this->blen>3) && ($this->i<=($this->blen-3))) { $maxLength = $this->blen-$this->i; if ($maxLength>10) $maxLength=10; if ($maxLength>2) { $result = $this->palmdoc_rsearch_fast($maxLength); if ($result!==false) { $bytes = 0x8000 + ($result[0]<<3) + $result[1]-3; $output .= chr($bytes >> 8) . chr($bytes % 256); //for ($j=1;$j<$result[1];$j++) $segments[$i+$j]['type'] = 'x'; //$this->i = $this->i + $result[1]; for ($j=0;$j<$result[1];$j++) $this->nextByte(); $result = true; } } } // [ ....... ] == cut out the rest }

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
epub Compressor / Optimizer	crackhammer	ePub	6	03-08-2018 08:53 AM
Image Compressor Built-in in Editor	yonkyunior	Editor	1	06-03-2017 08:00 PM
Another pdb (PalmDoc) to epub conversion error	CharsDox	Calibre	2	12-28-2010 07:06 AM
Makedoc: An Online Text to Palmdoc Converter	Bob Russell	Other formats	0	08-02-2005 06:49 AM
Using Publish eBook to Create PalmDoc Books	Bob Russell	Other formats	2	07-09-2005 10:55 AM

10-05-2019, 08:45 PM	#2
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Sorry, I'm kinda busy right now, dealing with v4 related things. As I recall, the C version is just a simple clone of the python version and can definitely stand to be optimized further, although in practice compression speed is not really a bottleneck.

Advert