|
|
Thread Tools | Search this Thread |
10-05-2019, 06:46 PM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Oct 2019
Device: none
|
Questions Palmdoc (MOBI) compressor implementation, improvements maybe?
Hello,
I've written a Palmdoc compressor from scratch in PHP using the information on this website and following the Calibre Palmdoc routines (both the python and c versions) There are some things I'm puzzled about in the Calibre code and I may also have some suggestions to speed up the compression. To expedite things, some links to information that you may want to access: Palmdoc compression algorithm : https://wiki.mobileread.com/wiki/Pal...ir_compression Github folder for the source code: https://github.com/kovidgoyal/calibr...ks/compression From this point, I'm going to refer only to palmdoc.c file in the above github folder. At line 123 there's this: Code:
if ( i > 10 && (b->len - i) > 10) { Byte pair is distance [2 bits : 10][11 bits: 1..2047][3 bits : sequence length] It seems to me like this condition is correct but too restrictive, only because further down on line 125, the sequence length is "hardwired" to 10 from the start, so there must always be at lest 9 bytes after current position otherwise code won't work. Let's say I have this text: "xSmithSmith" - this should compress to xSmith[5,2] (go back 5 bytes, copy 2+3 = 5 bytes) but if I understand correctly, the text would not compress because there's less than 10 bytes left when reaching the second "Smith" Also, I don't think i has to be bigger than 10, but seems like it was hardcoded to a less optimal value because of the rfind function. (line 103) The only conditions I see would be i>1 (because the minimum distance for pair encoding is 1 byte backwards) and i < length -3 (because minimum length is 3) For example, if I have this text : AAAA - I think it's valid to write it A[1,0] - 3 bytes. Or AJOHNJOHN should be saved as AJOHN[4,1] saving 2 bytes. With the palmdoc.c code it would not compress because i is not bigger than 10 and also i is not 10 less than length. Are there some compatibility issues with some decoders out there which make it better to not allow distance to be less than 10? Also, it seems to me like the rfind function is very unoptimized. * You're always going from i - sequence length [10 down to 3] when it really should from i-1 (to compress repeated characters like ------- or "zzzz..." * It's always going backwards to 0 , when it costs very little to compute where the limit is before the for ... you can only go back 2047 bytes. If you're compressing bytes from let's say 3000 bytes to 4096 bytes, you're gonna waste cpu cycles testing 1000-2000 bytes that can't be used if a result is found, because the distance is higher than 2047. In my version, I'm not even scanning the whole buffer, from my current position to 0. As I know the maximum size of the data to be compressed is 4096, it's possible to build an index of where each byte on the fly, as I advance through the buffer trying to compress bytes. Basically I have an array[0..255], where each element of the array is another array that holds the positions where the character was found. When I want to search for 3..10 byte sequence in the data that was already parsed, I can simply look in this array and check only those that are within 2047 bytes from my current position in the buffer. I'm saving a lot of byte compares. The amount of cpu cycles used updating this index is much less than backwards comparing tens of thousands of times the data... and the extra memory usage for the index array is minimal, a few KB worth of memory. Maybe I'm silly and looked at some outdated code, or this bit of code was tested as working even though it's not producing the most compressed data possible (suboptimal compression) ... but I'd like to know if these issues were intentional or not. I'd like my compressor to produce most compatible compressed text. Thank you all for any answers, Marius Here's my code for these rfind and compare for comparison... it works much much faster: If there's some issues about "inspiring" from the below code to improve Calibre's... consider the code below 100%, public domain or whatever's convenient for you guys. PHP Code:
|
10-05-2019, 08:45 PM | #2 |
creator of calibre
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sorry, I'm kinda busy right now, dealing with v4 related things. As I recall, the C version is just a simple clone of the python version and can definitely stand to be optimized further, although in practice compression speed is not really a bottleneck.
|
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
epub Compressor / Optimizer | crackhammer | ePub | 6 | 03-08-2018 08:53 AM |
Image Compressor Built-in in Editor | yonkyunior | Editor | 1 | 06-03-2017 08:00 PM |
Another pdb (PalmDoc) to epub conversion error | CharsDox | Calibre | 2 | 12-28-2010 07:06 AM |
Makedoc: An Online Text to Palmdoc Converter | Bob Russell | Other formats | 0 | 08-02-2005 06:49 AM |
Using Publish eBook to Create PalmDoc Books | Bob Russell | Other formats | 2 | 07-09-2005 10:55 AM |