Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 10-05-2019, 07:46 PM   #1
mariushm
Junior Member
mariushm began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Oct 2019
Device: none
Questions Palmdoc (MOBI) compressor implementation, improvements maybe?

Hello,

I've written a Palmdoc compressor from scratch in PHP using the information on this website and following the Calibre Palmdoc routines (both the python and c versions)

There are some things I'm puzzled about in the Calibre code and I may also have some suggestions to speed up the compression.

To expedite things, some links to information that you may want to access:

Palmdoc compression algorithm : https://wiki.mobileread.com/wiki/Pal...ir_compression

Github folder for the source code: https://github.com/kovidgoyal/calibr...ks/compression

From this point, I'm going to refer only to palmdoc.c file in the above github folder.

At line 123 there's this:

Code:
   if ( i > 10 && (b->len - i) > 10) {
So if it's not clear, a sequence of 3 to 10 bytes can be encoded as two bytes by the algorithm, if it's found in the already parsed data.
Byte pair is distance [2 bits : 10][11 bits: 1..2047][3 bits : sequence length]

It seems to me like this condition is correct but too restrictive, only because further down on line 125, the sequence length is "hardwired" to 10 from the start, so there must always be at lest 9 bytes after current position otherwise code won't work.
Let's say I have this text:
"xSmithSmith" - this should compress to xSmith[5,2] (go back 5 bytes, copy 2+3 = 5 bytes)
but if I understand correctly, the text would not compress because there's less than 10 bytes left when reaching the second "Smith"

Also, I don't think i has to be bigger than 10, but seems like it was hardcoded to a less optimal value because of the rfind function. (line 103)

The only conditions I see would be i>1 (because the minimum distance for pair encoding is 1 byte backwards) and i < length -3 (because minimum length is 3)

For example, if I have this text : AAAA - I think it's valid to write it A[1,0] - 3 bytes. Or AJOHNJOHN should be saved as AJOHN[4,1] saving 2 bytes.
With the palmdoc.c code it would not compress because i is not bigger than 10 and also i is not 10 less than length.

Are there some compatibility issues with some decoders out there which make it better to not allow distance to be less than 10?

Also, it seems to me like the rfind function is very unoptimized.
* You're always going from i - sequence length [10 down to 3] when it really should from i-1 (to compress repeated characters like ------- or "zzzz..."
* It's always going backwards to 0 , when it costs very little to compute where the limit is before the for ... you can only go back 2047 bytes.
If you're compressing bytes from let's say 3000 bytes to 4096 bytes, you're gonna waste cpu cycles testing 1000-2000 bytes that can't be used if a result is found, because the distance is higher than 2047.


In my version, I'm not even scanning the whole buffer, from my current position to 0.
As I know the maximum size of the data to be compressed is 4096, it's possible to build an index of where each byte on the fly, as I advance through the buffer trying to compress bytes.
Basically I have an array[0..255], where each element of the array is another array that holds the positions where the character was found.

When I want to search for 3..10 byte sequence in the data that was already parsed, I can simply look in this array and check only those that are within 2047 bytes from my current position in the buffer. I'm saving a lot of byte compares.

The amount of cpu cycles used updating this index is much less than backwards comparing tens of thousands of times the data... and the extra memory usage for the index array is minimal, a few KB worth of memory.

Maybe I'm silly and looked at some outdated code, or this bit of code was tested as working even though it's not producing the most compressed data possible (suboptimal compression) ... but I'd like to know if these issues were intentional or not. I'd like my compressor to produce most compatible compressed text.

Thank you all for any answers,
Marius

Here's my code for these rfind and compare for comparison... it works much much faster:
If there's some issues about "inspiring" from the below code to improve Calibre's... consider the code below 100%, public domain or whatever's convenient for you guys.


PHP Code:
// functions extracted from a bigger class I'm working on

// every time I advance in buffer, update the index array and put the position of this byte in array 
    
private function nextByte() {
        
$this->i++;
        if (
$this->$this->blen) {
            
$this->ord($this->buff[$this->i]); // ascii code of character
            
if (isset($this->map[$this->c])==false$this->map[$this->c] = array();
            
array_push($this->map[$this->c],$this->i);
        }

    }


   private function 
palmdoc_compare_fast($compare_start,$compareto_start,$length) {
        
$count 1;
        for (
$i=1;$i<$length;$i++) {
            if (
$this->buff[$compare_start+$i] != $this->buff[$compareto_start+$i]) return $count;
            
$count++;
        }
        return 
$count;
    }


    private function 
palmdoc_rsearch_fast($maxLength) {
        
$count count($this->map[$this->c]);
        if (
$count==1) return false// only our character indexed in map, so quit early

        
$result = array(0=>0,1=>0);
        
$scanLimit $this->i-2047; if ($scanLimit<0$scanLimit 0;
        for (
$k=$count-2;$k>-1;$k--) { // go through occurrences of this character backwards
            
if ($this->map[$this->c][$k]<$scanLimit) break; // outside of backwards limit, stop searching
            
$length $this->palmdoc_compare_fast($this->map[$this->c][$k],$this->i,$maxLength);

            if ((
$length>$result[1]) && ($length>2)) {
                
$result[0] = $this->i-$this->map[$this->c][$k];
                
$result[1] = $length;
            }
            if (
$length==10) break;  // found a maximum possible match, no point searching further
        
}

        if (
$result[1]==0) return false;
        return 
$result;
    }

   function 
compress_fast(string $inBuffer) {
        if (
$inBuffer==='') return '';
        
$this->buff $inBuffer;
        
$this->blen strlen($inBuffer);
        
$this->map = array();
        
$this->lastError 0;
        
$this->output '';
        
$this->= -1;
        
$this->nextByte();
        
$output '';
        while (
$this->$this->blen) {
            
$result false;
            
// Check if a sequence of 3..10 bytes is already decoded, in the previous 1..2047 bytes
            // Note: Checks are required to be able to compress sequences like #### = #[1:3]
            
if (($this->i>0) && ($this->blen>3) && ($this->i<=($this->blen-3))) {
                
$maxLength $this->blen-$this->i; if ($maxLength>10$maxLength=10;
                if (
$maxLength>2) {
                    
$result $this->palmdoc_rsearch_fast($maxLength);
                    if (
$result!==false) {
                        
$bytes 0x8000 + ($result[0]<<3) + $result[1]-3;
                        
$output .= chr($bytes >> 8) . chr($bytes 256);
                        
//for ($j=1;$j<$result[1];$j++) $segments[$i+$j]['type'] = 'x';
                        //$this->i = $this->i + $result[1];
                        
for ($j=0;$j<$result[1];$j++) $this->nextByte();
                        
$result true;
                    } 
                }
            }

// [ ....... ]  == cut out the rest 


mariushm is offline   Reply With Quote
Old 10-05-2019, 09:45 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 35,468
Karma: 12737927
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Sorry, I'm kinda busy right now, dealing with v4 related things. As I recall, the C version is just a simple clone of the python version and can definitely stand to be optimized further, although in practice compression speed is not really a bottleneck.
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
epub Compressor / Optimizer crackhammer ePub 6 03-08-2018 09:53 AM
Image Compressor Built-in in Editor yonkyunior Editor 1 06-03-2017 09:00 PM
Another pdb (PalmDoc) to epub conversion error CharsDox Calibre 2 12-28-2010 08:06 AM
Makedoc: An Online Text to Palmdoc Converter Bob Russell Other formats 0 08-02-2005 07:49 AM
Using Publish eBook to Create PalmDoc Books Bob Russell Other formats 2 07-09-2005 11:55 AM


All times are GMT -4. The time now is 03:33 PM.


MobileRead.com is a privately owned, operated and funded community.