View Single Post
Old 12-17-2008, 11:35 PM   #1
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Arrow Reverse-engineering the .IMP format

A primer on the .IMP specification has never been published, but a very detailed explanation of the .IMP file format can be found here. It was reverse-engineered by Jeffrey Kraus-yao back in 2002. Jeffrey indicated that how he reversed engineered the .imp format was by building, with eBook Publisher, .oeb test ebooks. Then he would drop that .oeb onto a desktop shortcut of the imp viewer.exe. Then while the viewer was still running, he would examine his temp folder and noticed one of four .RES folders was complete. He wrote down the changed bits and then repeatedly made small changes to the .oeb ebook. He started this even though he didn't own the REB1200 yet.

It was quite the accomplishment! Now please realize that back then only the REB1200 used the .IMP format as there weren't a lot of GEB1150's (predecessor to the EBW1150) in 2002/3. Oh, how the tables have turned as now for every REB1200 in use there are tens or hundreds more EBW1150's!

Jeffrey's website is a great start to understanding the .IMP file format, but lacks such basic information as:
  • clarification about all image types allowed to be stored in a .imp ebook;
  • byte ordering (LittleEndian vs. BigEndian) differences between the EBW1150 and REB1200 .imp types;
  • how to identify the .imp type i.e. for EBW1150/GEB1150 or for REB1200/GEB2150;
  • fact that DATA.FRK (compressed or uncompressed) is the same for the EBW1150 AND REB1200;
  • fact that images are stored in their original resolution and in color even though they may be reduced to fit the device screen and color depth;
  • some more "unknown" .RES filetypes like 'Devm', 'Form', etc.
You may have even noticed that the EBW1150 .imp is slightly bigger (filesize-wise) than the same REB1200 .imp when no images are present. That is because the EBW1150 .imp uses two additional (irrelevant/unused) bytes for most records and when multiplied by thousand of records results in a larger file! I think a different "programming team" came out with the EBW1150 .imp file format as most .RES filetype records are reversed byte wise (i.e. BE vs LE). It makes this .IMP reverse-engineering unnecessarily more difficult!

I would like to herein build a knowledge-base for the "definitive" understanding of the .IMP file format. As others have already expressed to me their own foray into .imp "nuts & bolts" investigations, I propose to start off this knowledge-base with my preliminary findings written as a Perl script. That script is imp_dump.pl (along with it's required support files) and can be used to exploded any un-encrypted .imp ebook into it's (decompressed) text and images components.

Now, take note, that I said text and images NOT .html and images.

The original html is not stored in the .imp file. Only the basic components are, like a record that tells you where all the font/styles changes are located in the file, another record indicates where to end the line so that it doesn't spill over the screen size of that .imp and other records that stores the images, hyperlinks used, etc. Basically all the building blocks are there (scattered) and we require those components to be re-assembled somehow into a .html!

BTW, release v4.0 of EBook-Tools should have basic .imp support for .html generation with image linking, but will initially lack table/hyperlink/styles support. Those are planned for future releases.

I plan to collect postings from this thread and compile a wiki page with the relevant parts of the .IMP file format specification as reverse-engineered by ALL of us!

Below are all the .RES filetypes that exist (thus far) and volunteers can pick the un-documented .RES filetypes on a "first come, first serve" basis.

Code:
.IMP file comprises these groups of .RES filetypes:
	text
	page_line
	page_header_footer
	links
	misc_info
	formatting
	tables
	images
	markups
	form_data

where:

text:
	'!!cm'
	'!!ky'
	DATA.FRK - decompressor written in Perl, C and soon to be C#.

page_line:
	'BPgz'
	'BPgZ'
	'ImRn' - written in Perl (see imp_dump.pl)
	'Pcz0' - written in Perl (see imp_dump.pl)
	'PcZ0' - written in Perl (see imp_dump.pl)
	'Pcz1' - written in Perl (see imp_dump.pl)
	'PcZ1' - written in Perl (see imp_dump.pl)

page_header_footer:
	'HfPz'
	'HfPZ'

links:
	'AncT' - written in Perl (see imp_dump.pl)
	'AnTg'
	'Lnks'
	'eLnk'

misc_info:
	'Batr'
	'Binf'
	'BGcl' - written in Perl (see imp_dump.pl)
	'BPos'
	'Clos'
	'Devm' - written in Perl (see imp_dump.pl)
	'Dict'
	'FRgs'
	'Glos'
	'MASK'
	'Mrgn' - written in Perl (see imp_dump.pl)
	'Hyp2'
	'Hyph'
	'Offs'
	'pInf' - written in Perl (see imp_dump.pl)
	'Pc31'
	'PPic' - written in Perl (see imp_dump.pl)
	'SKtb'
	'SMnu'
	'stbd'
	'!!sw' - written in Perl (see imp_dump.pl)

formatting:
	'ESts' - written in Perl (see imp_dump.pl)
	'HRle'
	'Styl'
	'StRn' - written in Perl (see imp_dump.pl)
	'StR#'
	'StR2'

tables:
	'Tabl'
	'TCel'
	'TRow'

images:
	'GIF ' - written in Perl (see imp_dump.pl)
	'JPEG' - written in Perl (see imp_dump.pl)
	'PIC2' - written in Perl (see imp_dump.pl)
	'PICT' - written in Perl (see imp_dump.pl)
	'PNG ' - written in Perl (see imp_dump.pl)

markups:
	'MRPs'
	'Ano2'
	'Hlts'
	'BTok'
	'BMks'

form_data:
	'TGNt'
	'Form'
	'FItm'
	'FIDt'
	'FrDt'
I'm calling out to any and all interested in this detailed examination of the .IMP file format! We need your findings summarized as either (1) Perl code or (2) offset / length with short narrative like Jeffrey's website uses. If you're so inclined, consider updating my imp_dump.pl to incorporate your findings and re-upload it here.

What you'll need is a test .imp, a good binary/hex editor (I use XVi32 Edit) and a lot of elbow grease and desire. Post here what you find out and I'll update the "un-documented" list above to reflect that!

Thanks in advance!

p.s. after unzipping the attachment, just place any and all your .imp files in the folder therein called 'place imp file here' and execute the 'extract imp files.bat'. Look at the generated file 'imp_dump.output.txt' for the parsing output info for all the .imp files placed in that folder. Then, look in that folder to see a directory for each .imp that will contain the compressed & decompressed text and any images. Have fun HEX-exploring!

EDIT: 06Jan2009: added a compiled windows executable to a separate .zip
Attached Files
File Type: zip Imp_dump_v0.1.zip (1.02 MB, 1483 views)
File Type: zip Imp_dump_v0.1_windows_executable.zip (1.97 MB, 1874 views)

Last edited by nrapallo; 01-06-2009 at 03:49 PM. Reason: added compiled windows executable to a separate .zip
nrapallo is offline   Reply With Quote