View Full Version : Reverse-engineering the .IMP format


nrapallo
12-17-2008, 11:35 PM
A primer on the .IMP specification has never been published, but a very detailed explanation of the .IMP file format can be found here (http://krausyaoj.tripod.com/reb1200.htm). It was reverse-engineered by Jeffrey Kraus-yao back in 2002. Jeffrey indicated that how he reversed engineered the .imp format was by building, with eBook Publisher, .oeb test ebooks. Then he would drop that .oeb onto a desktop shortcut of the imp viewer.exe. Then while the viewer was still running, he would examine his temp folder and noticed one of four .RES folders was complete. He wrote down the changed bits and then repeatedly made small changes to the .oeb ebook. He started this even though he didn't own the REB1200 yet.

It was quite the accomplishment! Now please realize that back then only the REB1200 used the .IMP format as there weren't a lot of GEB1150's (predecessor to the EBW1150) in 2002/3. Oh, how the tables have turned as now for every REB1200 in use there are tens or hundreds more EBW1150's!

Jeffrey's website is a great start to understanding the .IMP file format, but lacks such basic information as:
clarification about all image types allowed to be stored in a .imp ebook;
byte ordering (LittleEndian vs. BigEndian) differences between the EBW1150 and REB1200 .imp types;
how to identify the .imp type i.e. for EBW1150/GEB1150 or for REB1200/GEB2150;
fact that DATA.FRK (compressed or uncompressed) is the same for the EBW1150 AND REB1200;
fact that images are stored in their original resolution and in color even though they may be reduced to fit the device screen and color depth;
some more "unknown" .RES filetypes like 'Devm', 'Form', etc.
You may have even noticed that the EBW1150 .imp is slightly bigger (filesize-wise) than the same REB1200 .imp when no images are present. That is because the EBW1150 .imp uses two additional (irrelevant/unused) bytes for most records and when multiplied by thousand of records results in a larger file! I think a different "programming team" came out with the EBW1150 .imp file format as most .RES filetype records are reversed byte wise (i.e. BE vs LE). It makes this .IMP reverse-engineering unnecessarily more difficult!

I would like to herein build a knowledge-base for the "definitive" understanding of the .IMP file format. As others have already expressed to me their own foray into .imp "nuts & bolts" investigations, I propose to start off this knowledge-base with my preliminary findings written as a Perl script. That script is imp_dump.pl (along with it's required support files) and can be used to exploded any un-encrypted .imp ebook into it's (decompressed) text and images components.

Now, take note, that I said text and images NOT .html and images.

The original html is not stored in the .imp file. Only the basic components are, like a record that tells you where all the font/styles changes are located in the file, another record indicates where to end the line so that it doesn't spill over the screen size of that .imp and other records that stores the images, hyperlinks used, etc. Basically all the building blocks are there (scattered) and we require those components to be re-assembled somehow into a .html!

BTW, release v4.0 of EBook-Tools should have basic .imp support for .html generation with image linking, but will initially lack table/hyperlink/styles support. Those are planned for future releases.

I plan to collect postings from this thread and compile a wiki page with the relevant parts of the .IMP file format specification as reverse-engineered by ALL of us!

Below are all the .RES filetypes that exist (thus far) and volunteers can pick the un-documented .RES filetypes on a "first come, first serve" basis.

.IMP file comprises these groups of .RES filetypes:
text
page_line
page_header_footer
links
misc_info
formatting
tables
images
markups
form_data

where:

text:
'!!cm'
'!!ky'
DATA.FRK - decompressor written in Perl, C and soon to be C#.

page_line:
'BPgz'
'BPgZ'
'ImRn' - written in Perl (see imp_dump.pl)
'Pcz0' - written in Perl (see imp_dump.pl)
'PcZ0' - written in Perl (see imp_dump.pl)
'Pcz1' - written in Perl (see imp_dump.pl)
'PcZ1' - written in Perl (see imp_dump.pl)

page_header_footer:
'HfPz'
'HfPZ'

links:
'AncT' - written in Perl (see imp_dump.pl)
'AnTg'
'Lnks'
'eLnk'

misc_info:
'Batr'
'Binf'
'BGcl' - written in Perl (see imp_dump.pl)
'BPos'
'Clos'
'Devm' - written in Perl (see imp_dump.pl)
'Dict'
'FRgs'
'Glos'
'MASK'
'Mrgn' - written in Perl (see imp_dump.pl)
'Hyp2'
'Hyph'
'Offs'
'pInf' - written in Perl (see imp_dump.pl)
'Pc31'
'PPic' - written in Perl (see imp_dump.pl)
'SKtb'
'SMnu'
'stbd'
'!!sw' - written in Perl (see imp_dump.pl)

formatting:
'ESts' - written in Perl (see imp_dump.pl)
'HRle'
'Styl'
'StRn' - written in Perl (see imp_dump.pl)
'StR#'
'StR2'

tables:
'Tabl'
'TCel'
'TRow'

images:
'GIF ' - written in Perl (see imp_dump.pl)
'JPEG' - written in Perl (see imp_dump.pl)
'PIC2' - written in Perl (see imp_dump.pl)
'PICT' - written in Perl (see imp_dump.pl)
'PNG ' - written in Perl (see imp_dump.pl)

markups:
'MRPs'
'Ano2'
'Hlts'
'BTok'
'BMks'

form_data:
'TGNt'
'Form'
'FItm'
'FIDt'
'FrDt'

I'm calling out to any and all interested in this detailed examination of the .IMP file format! We need your findings summarized as either (1) Perl code or (2) offset / length with short narrative like Jeffrey's website uses. If you're so inclined, consider updating my imp_dump.pl to incorporate your findings and re-upload it here.

What you'll need is a test .imp, a good binary/hex editor (I use XVi32 Edit) and a lot of elbow grease and desire. Post here what you find out and I'll update the "un-documented" list above to reflect that!

Thanks in advance!

p.s. after unzipping the attachment, just place any and all your .imp files in the folder therein called 'place imp file here' :snicker: and execute the 'extract imp files.bat'. Look at the generated file 'imp_dump.output.txt' for the parsing output info for all the .imp files placed in that folder. Then, look in that folder to see a directory for each .imp that will contain the compressed & decompressed text and any images. Have fun HEX-exploring!

EDIT: 06Jan2009: added a compiled windows executable to a separate .zip

nrapallo
12-18-2008, 09:06 AM
Please refer to my original posting of deimp.exe here (http://www.mobileread.com/forums/showthread.php?t=22894) for some background info.

One of the component records in the .RES directory when the .IMP file is exploded with unimp.exe is the DATA.FRK file. It contains the basic text used in the ebook and is the same for both the Color VGA (REB 1200) & Grayscale Half-VGA (EBW 1150) .IMP files. This DATA.FRK file is decompressed by deimp.exe if it was originally (LZSS) compressed, when created, along with control characters (see below) being substituted/expanded.

DATA.FRK File

Element text is extracted and placed in this file. Elements tags are replaced with control characters. This file can be compressed and encrypted with compression occuring before encryption. This file is compressed when the element <meta name="x-SBP-compress" content="on"/> is included in the <x-metadata> element of the package file. The compression algorithm used is LZSS. This file is encrypted when the element <meta name="x-SBP-encrypt" content="on"/> is included in the <x-metadata> element of the package file. The encryption algorithm used is DES. The 8 byte encryption key is in the SoftBook Edition Encryption Key File (.key) at offset 0x0C.

Characters less than 0x20 are removed expect for line break which is replaced with 0x20. Mutliple 0x20 characters are replaced with a single 0x20.

Control characters
0x0A end of document, forced page break
0x0B start of element except < span >
0x0D line break element < br / >
0x0E start of table element < table >
0x0F image element < img / >
0x13 end of table cell < /td > tag
0x14 horizontal rule element < hr / >
0x15 before and after page header content
0x16 before and after page footer content

As previously stated, my deimp.exe program used as it's base the lzss-0.6 code by Michael Dipperstein (http://michael.dipperstein.com/lzss), with tweaks by me to get it to decode the .imp text. I added the ability to insert/substitute some characters that are not part of the lzss decompression so that the resulting .imp text looked better. Just remove those and then after decompression, you can substitute them back.

In addition to those control characters above, characters to "substitute/convert" would be: HEX => Should be (actual char)
0x8E => "&eacute;" (i.e. ""),
0xA0 => "&nbsp;", (i.e. " "),
0xA5 => "&bull;", (i.e. ""),
0xA8 => "&reg;", (i.e. ""),
0xA9 => "&copy;", (i.e. ""),
0xAA => "&trade;", (i.e. ""),
0xAE => "&AElig;", (i.e. ""),
0xC7 => "&laquo;", (i.e. ""),
0xC8 => "&raquo;", (i.e. ""),
0xC9 => "&hellip;", (i.e. ""),
0xD0 => "&ndash;", (i.e. ""),
0xD1 => "&mdash;", (i.e. ""),
0xD2 => "&ldquo;", (i.e. ""),
0xD3 => "&rdquo;", (i.e. ""),
0xD4 => "&lsquo;", (i.e. ""),
0xD5 => "&rsquo;", (i.e. ""),
0xE1 => "&middot;", (i.e. ""),


I attach the source code to my deimp.exe (and original lzss-0.6) below for your use and further study. Please excuse the coding hacks as this was a work-in-progress until I "nailed" the decompression algorithm. It didn't lend itself to good programming style. :o

p.s. as an exercise, would anyone want to try tweaking this code to allow the LZSS (re-)compression of text for use as the DATA.FRK in the .imp?

mscott161
12-18-2008, 05:51 PM
Hello All,

With Nick's help, I was able to put together a simple de-imp program to pull the text out of the imp file. I appologize for the lack of documentation and plan to add that to the code soon. Please feel free to submit any changes to the code. I am working on a GUI front end so you can select a single file or folder and the program will de-imp the files to text and hopefully with some help to the file format of lrf (Sony).

Thanks to all.
-Michael

nrapallo
12-18-2008, 09:38 PM
I am working on a GUI front end so you can select a single file or folder and the program will de-imp the files to text and hopefully with some help to the file format of lrf (Sony).

:eek: GUI? GUI! That would be a welcomed addition to the current dos tools! :thumbsup:

Nice to see you are already adding to our knowledge-base!

Thanks again for sharing this!

mscott161
12-19-2008, 12:28 AM
Nick,

I agree with you on the lack of something to look at besides the console. So here is a GUI version of my previous ConvertIMP.

I know it is not much, but it is something to start with.

I will try to put in aknowledgements in the next update to the program but till then I would like to thank Nick and Michael Dipperstein for their libraries that helped me produce the program.

-Michael

New addition to v.1.0.1 - Image Viewing

New addition to v.1.0.2 - Image Viewing (including PNG, GIF, JPEG)

New addition to v.1.1.0 - Editting Book Properties. --- I am removing this pending furthure testing. Sorry for the delay.

nrapallo
12-19-2008, 12:59 AM
So here is a GUI version of my previous ConvertIMP.

Looks very promising!!!

So when will we be able to edit the metadata? v1.1? ;)

:thanks:

mscott161
12-19-2008, 11:55 AM
What would you like to edit?

--Michael

nrapallo
12-19-2008, 12:44 PM
What would you like to edit?

--Michael

Where do I begin (as shouts can be heard from around the world from .imp users! :rofl:):

Per the .imp specs:
Edit - Book properties start at offset 0x30
Yes* - ID: null terminated C string
YES - Bookshelf Category: null terminated C string
N/A - Subcategory: null terminated C string, not displayed on REB1200
YES - Title: null terminated C string
N/A - Last name: null terminated C string
N/A - Middle name: null terminated C string
YES - First name: null terminated C string

Note * = there is a way to auto-generate this ID.; N/A = not allowed; YES = allow edits

Afterwards, the length of Book Properties (including 7 null's) needs to be updated so that BytesRemainingInHeader is set to length of Book Properties + 24!

Also, it would be nice if the Name of .RES directory could be changed to the .imp filename (minus .ext) or even auto-generated to be 'Author-Title' as you used for the (decompressed) text filename. The DictionaryLength (length of directory name) in the 48 byte header would have to be updated then as well. (p.s. you called it Dictionary, but I think you meant Directory...)

That's it for now... :snicker:

Check please!

DaleDe
12-19-2008, 12:49 PM
Where do I begin (as shouts can be heard from around the world from .imp users! :rofl:):

Per the .imp specs:
Edit - Book properties start at offset 0x30
Yes* - ID: null terminated C string
YES - Bookshelf Category: null terminated C string
N/A - Subcategory: null terminated C string, not displayed on REB1200
YES - Title: null terminated C string
N/A - Last name: null terminated C string
N/A - Middle name: null terminated C string
YES - First name: null terminated C string

Note * = there is a way to auto-generate this ID.; N/A = not allowed; YES = allow edits

Afterwards, the length of Book Properties (including 7 null's) needs to be updated so that BytesRemainingInHeader is set to length of Book Properties + 24!

Also, it would be nice if the Name of .RES directory could be changed to the .imp filename (minus .ext) or even auto-generated to be 'Author-Title' as you used for the (decompressed) text filename. The DictionaryLength (length of directory name) in the 48 byte header would have to be updated then as well. (p.s. you called it Dictionary, but I think you meant Directory...)

That's it for now... :snicker:

Check please!

I think the big things a user would like to correct include: category!!!, Author, Title.

Dale

mscott161
12-19-2008, 12:55 PM
I have updated the ConvertIMPGUI program to include viewing of the images in the IMP file. Please look at the attachments in a previous message for the update.

I will look into the category, author, and title changes.

--Michael

nrapallo
12-19-2008, 12:58 PM
I think the big things a user would like to correct include: category!!!, Author, Title.

Dale

And just to add that eBookwise/GEB Librarian do an excellent job allowing users to edit and setup their own categories to use when editing the category.

Basically, the users choose the category (when editing the metadata) from a drop-box list that includes predefined categories plus user previoulsy used/defined ones.

They do say that imitation is the greatest form of flattery...

Just food for thought!

mscott161
12-19-2008, 01:24 PM
Sorry about the image viewing before I only included PNG. I have updated the code in the previous message (see new attachment to that message for v.1.0.2) which now include (PNG, GIF, and JPEG).

--Michael

mscott161
12-21-2008, 09:47 PM
Nick and Everyone,

I have put in book property editing in the release v.1.1.0 Hope you enjoy. I am all ears if something does not work correctly or ideas on GUI arrangement. If you have a specific idea on GUI send a screen shot of what you want it to look like and I will take a look.

I updated the ConvertIMPGUI Post above with the new version.

Happy Holidays
--Michael

nrapallo
12-22-2008, 08:47 AM
Nick and Everyone,

I have put in book property editing in the release v.1.1.0 Hope you enjoy. I am all ears if something does not work correctly or ideas on GUI arrangement. If you have a specific idea on GUI send a screen shot of what you want it to look like and I will take a look.

I updated the ConvertIMPGUI Post above with the new version.

Happy Holidays
--Michael

As we are still testing the waters with .imp editing, please try using ConvertIMP on backup copies only until you verify editing works for you. I've had a few issues already that I notified Michael about.

Please proceed with caution with these early releases!

mscott161
12-22-2008, 02:28 PM
I have started a new thread with Nicks help. I will post all release and changes to it.

-- Michael

nrapallo
01-06-2009, 12:15 PM
Is there any demand for an imp_dump.exe (windows exectuable)?

I don't include one in the .zip file attached to post #1 above, since I expect that the imp_dump.pl would get modified frequently and best be left as a Perl script to be repeatedly invoked.

Would some MR members like to try this, without having to have Perl installed on their computer first? Just ask and I will update imp_dump.pl and post an imp_dump.exe. OK, OK, I've now added a compiled windows executable to a separate .zip in post #1 above.

On a related note, does anyone (other than Michael) have any additions to make to that Perl script that I could incorporate into this .exe?

Thanks in advance!

nrapallo
01-06-2009, 04:13 PM
Is there any demand for an imp_dump.exe (windows exectuable)?

Imp_dump.exe (windows exectuable) just provided... see post #1!

Enjoy!

nrapallo
01-08-2009, 04:49 PM
I've started to add the details of the .IMP file format to our IMP wiki and will continue to expand it as more and more information is "discovered" by reverse engineering. See here (http://wiki.mobileread.com/wiki/IMP#IMP_File_Format_.28Technical_Specs.29 ) for the specifics, thus far.

KRavEN
02-12-2009, 09:49 AM
So now that you can get the text out of an imp, do you have a perl script that will create an imp from opf or html without needing eBook Publisher?

nrapallo
02-12-2009, 10:27 AM
So now that you can get the text out of an imp, do you have a perl script that will create an imp from opf or html without needing eBook Publisher?

Not likely. Without using eBook Publisher, it would mean that you would have do all the font metric calculations and page dissections that IT does to create the .imp. Now that would be like re-inventing the wheel!

However, we are looking to create a .imp DIRECTLY from a set of images/pictures (and little text/styles). This has already been done by the togoWare Photo Album (http://www.togoware.com/) program, but unfortunately, no source code exists, as it is commercial software.

But once you extract the .html/images/styles (see ConvertIMP (http://www.mobileread.com/forums/showthread.php?t=34548); it can do this now) you can use perl scipts (like those here (http://www.mobileread.com/forums/showthread.php?t=20050)) to use the COM/OLE interface with the eBook Publisher's .dll. This doesn't use the eBook Publisher GUI at all.

mscott161
02-12-2009, 11:23 AM
I have been working on the converting of ebooks formats to take any purchased ebook and create the format needed for your device. IMP is one format that I am still working on putting together. The idea would be to have a complete set of source code with no third-party COM/OLE or DLL to do this.

Michael

nrapallo
02-12-2009, 11:49 AM
I have been working on the converting of ebooks formats to take any purchased ebook and create the format needed for your device. IMP is one format that I am still working on putting together. The idea would be to have a complete set of source code with no third-party COM/OLE or DLL to do this.

Michael

I know that rbmake (hosted on sourceforge) can generate a Rocket eBook .rb ebook directly (predecessor to the .imp format, but not similar at all). This bypassed the need to use the Rocket/eBook Librarian software by Gemstar.

Since the .imp format has very little official documentation, replacing it's sole generating software (eBook Publisher) will be a challenge to say the least... :eek:

llasram
02-12-2009, 01:27 PM
I have been working on the converting of ebooks formats to take any purchased ebook and create the format needed for your device. IMP is one format that I am still working on putting together. The idea would be to have a complete set of source code with no third-party COM/OLE or DLL to do this.

That's largely in line with the goal of the calibre project as well, which is well on it's way there with a large user-base. Any way to convince you to contribute to calibre instead of rolling your own?

nrapallo
02-12-2009, 01:44 PM
That's largely in line with the goal of the calibre project as well, which is well on it's way there with a large user-base. Any way to convince you to contribute to calibre instead of rolling your own?

Ooohhh, a merger, of sorts, or takeover!

While I would welcome the marriage of these two tools, ConvertIMP is really an "extraction" tool right now, to get the underlying .html/images/styles. The results can then be easily converted to .prc / .lrf / .imp / .epub using existing command line programs, calibre included.

ashkulz was thinking about doing the native Calibre .imp support a while back, but .imp generation is not easily supported cross-platform (Windows/MacOS/Linux) so that would be a major hurdle. Unfortunately, we need, first and foremost, the ETI eBook Publisher software to accomplish .imp generation.

When, and if, Michael can generate .imp ebooks directly without the eBook Publisher's .dlls this will become useful for "converting to .imp". For now, I think "converting from .imp" would only be possible.

What are your thoughts, Michael?

BTW, I'm not proficient enough in python to contribute direct code, but can be available for the tough questions. :)