Mobiperl Mobiperl - Perl tools for handling MobiPocket files - Page 10

Jaapjan · 01-10-2008, 07:23 AM

I am curious actually if you wrote the decompression / decryption for the records after the EXTH / PDB record 0 yourself or if you let Perl do that for you? Or did Mobipocket make something special out of it?

Maybe you'll induldge me and also tell me how you decide the actual amount of PDB records needed to be decompressed for the content since the last three(?) pdb records clearly aren't part of the content itself. They're way too small for that.

And.. why would they split the content in such small blocks of 2000 characters or less? Easier handling for small mobile devices?

tompe · 01-10-2008, 12:11 PM

Quote:

Originally Posted by Jaapjan

I am curious actually if you wrote the decompression / decryption for the records after the EXTH / PDB record 0 yourself or if you let Perl do that for you? Or did Mobipocket make something special out of it?

The Perl modules Palm::PDB and Palm:

oc takes care of the compression and the decompression but this will not work for highest compression because this compression is a secret MobiPocket scheme. I had to overide some code in one of these modules for it to work on some DRM:ed files (also decompress if version was 5).

Quote:

Maybe you'll induldge me and also tell me how you decide the actual amount of PDB records needed to be decompressed for the content since the last three(?) pdb records clearly aren't part of the content itself. They're way too small for that.

The Palm Doc header tells how many records there are in the document. I think that the last two small records that are there sometimes and images are just add ons to the original format and these add ons are not compressed.

Quote:

And.. why would they split the content in such small blocks of 2000 characters or less? Easier handling for small mobile devices?

I have kind of wondered about that also. The number of records for my Oxford Concise Dictionary was ridiculous. Maybe it speeds up the searching or something like that. If you have one word per record for example...

Jaapjan · 01-11-2008, 02:55 AM

Quote:

Originally Posted by tompe

The Perl modules Palm::PDB and Palm:

oc takes care of the compression and the decompression but this will not work for highest compression because this compression is a secret MobiPocket scheme. I had to overide some code in one of these modules for it to work on some DRM:ed files (also decompress if version was 5).

I found a document from 2002 or something akin to that which described the regular compression scheme used and I implemented those now, more or less. (Since I hardly have the Perl modules available. Non perl code.) I have yet to run into any odd Mobi-format.. but then, I have not been looking. Maybe when the code can do a little more.

Quote:

Originally Posted by tompe

The Palm Doc header tells how many records there are in the document. I think that the last two small records that are there sometimes and images are just add ons to the original format and these add ons are not compressed.

True, but when you start decompressing the data at PDB record 1 (0 being the one holding the PDB 0 header, Mobi Header and EXTH header, how do you know when to end that file. For that matter, to what file does the HTML you're decoding belong to anyway. The index?

Quote:

Originally Posted by tompe

I have kind of wondered about that also. The number of records for my Oxford Concise Dictionary was ridiculous. Maybe it speeds up the searching or something like that. If you have one word per record for example...

Perhaps memory constrained devices read these blocks in memory as sort of cache and move to the next and / or previous one only when needed.

tompe · 01-11-2008, 08:08 AM

Quote:

Originally Posted by Jaapjan

True, but when you start decompressing the data at PDB record 1 (0 being the one holding the PDB 0 header, Mobi Header and EXTH header, how do you know when to end that file. For that matter, to what file does the HTML you're decoding belong to anyway. The index?

It also holds some other data like a long title and DRM stuff.

There is only one chunk of text that is compressed so you decompress all the records from record 1 to record n_document_records and n_document_records is the data you find in the PDB record.

The Perl code doing the decompression is:

Code:

                my $header = $recs->[0];
                if( defined _parse_headerrec($header) ) {
                        # a proper Doc file should be fine, but if it's not Doc
                        # compression like some Mobi docs seem to be we want to
                        # bail early. Otherwise we end up with a huge stream of
                        # substr() errors and we _still_ don't get any content.
                        eval {
                                sub min { return ($_[0]<$_[1]) ? $_[0] : $_[1] \
}
                                my $maxi = min($#$recs, $header->{'records'});
                                for( my $i = 1; $i <= $maxi; $i ++ ) {
                                                $body .= _decompress_record( $h\
eader->{'version'},
                                                        $recs->[$i]->{'data'} )\
;
                                }
                        };
                        return undef if $@;
                }


# algorithm taken from makedoc7.cpp with reference to
# http://patb.dyndns.org/Programming/PilotDoc.htm and
# http://www.pyrite.org/doc_format.html
sub _decompress_record($$) {
        my ($version,$in) = @_;
        return $in if $version == DOC_UNCOMPRESSED;

        my $out = '';

        my $lin = length $in;
        my $i = 0;
        while( $i < $lin ) {
                my $ch = substr( $in, $i ++, 1 );
                my $och = ord($ch);

                if( $och >= 1 and $och <= 8 ) {
                        # copy this many bytes... basically a way to 'escape' d\
ata
                        $out .= substr( $in, $i, $och );
                        $i += $och;
                } elsif( $och < 0x80 ) {
                        # pass through 0, 9-0x7f
                        $out .= $ch;
                } elsif( $och >= 0xc0 ) {
                        # 0xc0-0xff are 'space' plus ASCII char
                        $out .= ' ';
                        $out .= chr($och ^ 0x80);
                } else {
                        # 0x80-0xbf is sequence from already decompressed buffe\
r
                        my $nch = substr( $in, $i ++, 1 );
                        $och = ($och << 8) + ord($nch);
                        my $m = ($och & 0x3fff) >> 3;
                        my $n = ($och & 0x7) + 3;

                        # This isn't very perl-like, but a simple
                        # substr($out,$lo-$m,$n) doesn't work.
                        my $lo = length $out;
                        for( my $j = 0; $j < $n; $j ++, $lo ++ ) {
                                die "bad Doc compression" unless ($lo-$m) >= 0;
                                $out .= substr( $out, $lo-$m, 1 );
                        }
                }
        }

        return $out;
}

Jaapjan · 01-13-2008, 12:43 PM

Thanks to you as well as a few palm documentation files from 2001 (and lots of use of a hex editor, ultraedit) I managed to make some prototype C# code that reads out all the raw HTML content. To share with you some information, if you like, it is actually list this:

Palm header
Palm record index
MOBI header (Kind of obvious)
EXTH header (A dictionary format set of information about the book)
Content (Compressed)
Images (Uncompressed)
FLIS header (license information)
FCIS header (images information)

I remain unsure on how to determine where the images start and how long they are. Nor do I know the 2 byte record between content & images nor the 4 byte one at the end. Maybe some sort of checksum.

tompe · 01-13-2008, 01:01 PM

Quote:

Originally Posted by Jaapjan

Thanks to you as well as a few palm documentation files from 2001 (and lots of use of a hex editor, ultraedit) I managed to make some prototype C# code that reads out all the raw HTML content. To share with you some information, if you like, it is actually list this:

Palm header
Palm record index
MOBI header (Kind of obvious)
EXTH header (A dictionary format set of information about the book)
Content (Compressed)
Images (Uncompressed)
FLIS header (license information)
FCIS header (images information)

I remain unsure on how to determine where the images start and how long they are. Nor do I know the 2 byte record between content & images nor the 4 byte one at the end. Maybe some sort of checksum.

In my EXTH.pm I have tried to document as much as I have learned about the possible information in the EXTH. You can also have DRM information before or after the EXTH.

Are you sure that FLIS anc FCIS is license information?

I think the record format contains information about how long the record is.

I think that the first image record index is in MOBI+0x5B. When I decode a file I just check each record to see if it is an image.

Jaapjan · 01-13-2008, 01:09 PM

Quote:

Originally Posted by tompe

In my EXTH.pm I have tried to document as much as I have learned about the possible information in the EXTH. You can also have DRM information before or after the EXTH.

Are you sure that FLIS anc FCIS is license information?

I think the record format contains information about how long the record is.

I think that the first image record index is in MOBI+0x5B. When I decode a file I just check each record to see if it is an image.

MOBI +0x5B for the image start? What if you have content much shorter then 0x5B records?

Actually the length of the content can simply be read from the header after which the images start. There's a 2 byte header in between. As for the images, do you also assume each image is 2 records long? Or get the length from elsewhere?

No, FCIS isn't about license information. It contains information about the images & the content size at lease.

JeffElkins · 01-13-2008, 01:31 PM

Code:

lit2mobi lembert_02_-_Stranglers_Moon.lit
Unpack file lembert_02_-_Stranglers_Moon.lit in dir ctmp
+---[ ConvertLIT (Version 1.8) ]---------------[ Copyright (c) 2002,2003 ]---
ConvertLIT comes with ABSOLUTELY NO WARRANTY; for details
see the COPYING file or visit "http://www.gnu.org/license/gpl.html".
This is free software, and you are welcome to redistribute it under
certain conditions.  See the GPL license for details.
LIT INFORMATION.........
DRM         =  1
Timestamp   =  6ac89519
Creator     =  00000000
Language    =  00000409
Writing out "d'Alembert_2_-_Stranglers_Moon" as "d'Alembert 2 - Stranglers Moon.htm" ...
Successfully written to "ctmp/d'Alembert 2 - Stranglers Moon.htm".

Writing out "RW_~Cover01" as "~Cover01.jpg" ...
Successfully written to "ctmp/~Cover01.jpg".

Writing out "RW_~Cover02" as "~Cover02.jpg" ...
Successfully written to "ctmp/~Cover02.jpg".

Writing out "RW_~Cover03" as "~Cover03.jpg" ...
Successfully written to "ctmp/~Cover03.jpg".

Writing out "RW_~Cover04" as "~Cover04.jpg" ...
Successfully written to "ctmp/~Cover04.jpg".

Writing out "RW_~Cover05" as "~Cover05.jpg" ...
Successfully written to "ctmp/~Cover05.jpg".

Exploded "lembert_02_-_Stranglers_Moon.lit" into "ctmp/".
Read in HTML tree from opf
Opf: Initialize from file: lembert_02_-_Stranglers_Moon.opf
CONTENT: <?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE package
  PUBLIC "+//ISBN 0-9673008-1-9//DTD OEB 1.0.1 Package//EN"
  "http://openebook.org/dtds/oeb-1.0.1/oebpkg101.dtd">
<package unique-identifier="OverDriveGUID">
 <metadata>
  <dc-metadata xmlns:dc="http://purl.org/dc/elements/1.0/" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
   <dc:Title>d'Alembert 2 - Stranglers Moon</dc:Title>
   <dc:Identifier id="OverDriveGUID" scheme="GUID">{6ECBF068-47B6-49B7-838E-CB056DA516B7}</dc:Identifier>
  </dc-metadata>
  <x-metadata>
   <meta name="rwver-ReaderWorks-SDK-Control" content="2, 0, 2, 0215 (02/15/2002)" />
   <meta name="rwver-HTML-Input-Filter" content="2, 0, 2, 0215 (02/15/2002)" />
   <meta name="rwver-Image-Input-Filter" content="2, 0, 2, 0215 (02/15/2002)" />
   <meta name="rwver-Text-Input-Filter" content="2, 0, 2, 0215 (02/15/2002)" />
   <meta name="rwver-Word-Doc-Input-Filter" content="2.0.2.0215 (02/15/2002)" />
   <meta name="rwver-LIT-file-generator" content="1.5.1.0280 (10/05/2000)" />
   <meta name="rw-License-Key" content="RWPTL" />
  </x-metadata>
 </metadata>
 <manifest>
  <item id="d'Alembert_2_-_Stranglers_Moon" href="d'Alembert 2 - Stranglers Moon.htm" media-type="text/html" />
  <item id="RW_~Cover01" href="~Cover01.jpg" media-type="image/jpeg" />
  <item id="RW_~Cover02" href="~Cover02.jpg" media-type="image/jpeg" />
  <item id="RW_~Cover03" href="~Cover03.jpg" media-type="image/jpeg" />
  <item id="RW_~Cover04" href="~Cover04.jpg" media-type="image/jpeg" />
  <item id="RW_~Cover05" href="~Cover05.jpg" media-type="image/jpeg" />
 </manifest>
 <spine>
  <itemref idref="d'Alembert_2_-_Stranglers_Moon" />
 </spine>
 <guide>
  <reference type="other.ms-thumbimage-standard" href="~Cover01.jpg" />
  <reference type="other.ms-coverimage-standard" href="~Cover02.jpg" />
  <reference type="other.ms-titleimage-standard" href="~Cover03.jpg" />
  <reference type="other.ms-thumbimage" href="~Cover04.jpg" />
  <reference type="other.ms-coverimage" href="~Cover05.jpg" />
 </guide>
</package>
OPF: TITLE: d'Alembert 2 - Stranglers Moon
OPF: CREATOR:
Init from manifest
d'Alembert_2_-_Stranglers_Moon - d'Alembert 2 - Stranglers Moon.htm - text/html
RW_~Cover01 - ~Cover01.jpg - image/jpeg
Could not read image file: ~Cover01.jpg
RW_~Cover02 - ~Cover02.jpg - image/jpeg
Could not read image file: ~Cover02.jpg
RW_~Cover03 - ~Cover03.jpg - image/jpeg
Could not read image file: ~Cover03.jpg
RW_~Cover04 - ~Cover04.jpg - image/jpeg
Could not read image file: ~Cover04.jpg
RW_~Cover05 - ~Cover05.jpg - image/jpeg
Could not read image file: ~Cover05.jpg
Warning, RW_~Cover01 missing from spine, adding
Warning, RW_~Cover02 missing from spine, adding
Warning, RW_~Cover03 missing from spine, adding
Warning, RW_~Cover04 missing from spine, adding
Warning, RW_~Cover05 missing from spine, adding
Init from guide
OPFTITLE: d'Alembert 2 - Stranglers Moon
OPFAUTHOR:
Coverimage: ~Cover02.jpg
SPINE: adding d'Alembert_2_-_Stranglers_Moon - d'Alembert 2 - Stranglers Moon.htm - text/html
Adding: d'Alembert 2 - Stranglers Moon.htm - d'Alembert_2_-_Stranglers_Moon
+++.+SPINE: adding RW_~Cover01 - ~Cover01.jpg - image/jpeg
SPINE: adding RW_~Cover02 - ~Cover02.jpg - image/jpeg
SPINE: adding RW_~Cover03 - ~Cover03.jpg - image/jpeg
SPINE: adding RW_~Cover04 - ~Cover04.jpg - image/jpeg
SPINE: adding RW_~Cover05 - ~Cover05.jpg - image/jpeg
All spine elements have been added
Have Read in HTML tree from opf
Saving mobi file (version 4): lembert_02_-_Stranglers_Moon.mobi
COVEROFFSET: 0
THUMBOFFSET: 1
EXTH setting data: author - 100 -  - 0x
EXTH add: author - 100 -
EXTH setting data: coveroffset - 201 - 0 - 0x30
EXTH add: coveroffset - 201 - 0 - 0x30
EXTH setting data: thumboffset - 202 - 1 - 0x31
EXTH add: thumboffset - 202 - 1 - 0x31
MOBIHDR: imgrecpointer: 112
EXTH setting data: author - 100 -  - 0x
EXTH add: author - 100 -
EXTH setting data: coveroffset - 201 - 0 - 0x30
EXTH add: coveroffset - 201 - 0 - 0x30
EXTH setting data: thumboffset - 202 - 1 - 0x31
EXTH add: thumboffset - 202 - 1 - 0x31
New record for image 112: ~Cover02.jpg
Reading data from file: ~Cover02.jpg
[Image::BMP] ERROR: Not a bitmap: [~Cover02.jpg] at /usr/local/bin/MobiPerl/Util.pm line 486

I'm still finding that lit2mobi chokes on a certain percentage of books I'm trying to convert. The common denominator seems to be that they all were processed by Microsoft Word (all the jpegs are MS Word Graphics). Rather than just crash, could lit2mobi call html2mobi to try to process the html file in ctmp or something? These crashes play hob with batch processing.

Edit: This was with version .25

tompe · 01-13-2008, 01:32 PM

Quote:

Originally Posted by Jaapjan

MOBI +0x5B for the image start? What if you have content much shorter then 0x5B records?

In the address MOBI+0x5C (wrote B by mistake, that is byte 0x5C-0x5F in the MOBI header) you have four bytes that are the index for the first image record.

Quote:

As for the images, do you also assume each image is 2 records long? Or get the length from elsewhere?

Each image can only be one record. That is the reason for the 64K limit on images.

Quote:

No, FCIS isn't about license information. It contains information about the images & the content size at lease.

How do you know what FLIS and FCIS are about?

Jaapjan · 01-13-2008, 01:36 PM

Quote:

How do you know what FLIS and FCIS are about?

Well, the FCIS I know because it contains a field with the content length as well as a field that indicates the number of images available in the file. It also grows and shrinks depending on how many image files are present in the file so it suggests it is something of an information record where most information is for the images.

I really need more mobipocket files before I can be sure about the FLIS because I actually only have a batch of DRM'less versions I test with. However it is a static sized record with remarkably much 0x0's in it and 0xFFFFFFFF's suggesting unused information in the record. And since all the other relevant information is elsewhere it seems very probable it is for the DRM. As mentioned, I ened a DRM'ed file to be sure.

tompe · 01-13-2008, 01:41 PM

Quote:

Originally Posted by Jaapjan

Well, the FCIS I know because it contains a field with the content length as well as a field that indicates the number of images available in the file. It also grows and shrinks depending on how many image files are present in the file so it suggests it is something of an information record where most information is for the images.

OK. But this record is not obligatory. Maybe it is some kind of optimization.

Which record is cover image and which is thumb nail is given by data in EXTH.

Jaapjan · 01-13-2008, 01:43 PM

Quote:

Originally Posted by tompe

OK. But this record is not obligatory. Maybe it is some kind of optimization.

Which record is cover image and which is thumb nail is given by data in EXTH.

Every file I have seen so far, including the creator, make these two records as well.

But it is interesting that you say they're optional. That means there must be an indication somewhere if they're included or not.

tompe · 01-13-2008, 02:01 PM

Quote:

Originally Posted by Jaapjan

Every file I have seen so far, including the creator, make these two records as well.

But it is interesting that you say they're optional. That means there must be an indication somewhere if they're included or not.

I have not generated them and every file I have generated works on reading devices. Why must there be an indication if they are included? You just have to check the first bytes of every record that is not content to see if they are there.

I would be more than happy to generate these record if I manage to find out the format of them and what they are for.

alexxxm · 01-14-2008, 02:48 AM

have you ever thought about implementing a convertion to the Sony LRF format in your suite? I'd really like to write it myself, but that format is not well documented, and/or relies on some Windows DLL...

Alessandro

Jaapjan · 01-14-2008, 04:22 AM

Quote:

Originally Posted by alexxxm

have you ever thought about implementing a convertion to the Sony LRF format in your suite? I'd really like to write it myself, but that format is not well documented, and/or relies on some Windows DLL...
Alessandro

Personally I am hardly at that stage. I am just doing some hobby programming left and right and my current interest lies in what tompe is already doing.

Currently he is more often right then I am. But he might want to do some Sony code into his program. Though aren't there any LRF projects that convert to HTML? From HTML it is easy to get to Mobipocket.

Speaking of which,

What kind of testbatch of files do you use Tompe? Can you provide me with URL's to them? I am affraid you were right about the images too. Maybe anyway. I discovered that there are many more images in the file then I thought (also JPG's, not only GIF's).

Back to the Mansion!

01-10-2008, 07:23 AM	#136
Jaapjan Avid reader Posts: 262 Karma: 132 Join Date: Mar 2005 Location: The Netherlands Device: HTC Touch Diamond, iLiad Book Edition	I am curious actually if you wrote the decompression / decryption for the records after the EXTH / PDB record 0 yourself or if you let Perl do that for you? Or did Mobipocket make something special out of it? Maybe you'll induldge me and also tell me how you decide the actual amount of PDB records needed to be decompressed for the content since the last three(?) pdb records clearly aren't part of the content itself. They're way too small for that. And.. why would they split the content in such small blocks of 2000 characters or less? Easier handling for small mobile devices? Last edited by Jaapjan; 01-10-2008 at 09:44 AM. Reason: More questions!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Mobi2Mobi Mobi2Mobi v0.13 - GUI for Mobiperl tools	Jad	Kindle Formats	476	03-15-2015 05:51 PM
Tools for Editing Kindle .mobi Files?	GJN	Kindle Formats	33	12-26-2013 02:05 PM
Handy Perl Script to convert HTML0 files to smartquotes	maggotb0y	Sony Reader	0	04-12-2007 11:49 AM
PRS-500 Perl tools to generate Reader content	TadW	Sony Reader Dev Corner	0	01-08-2007 05:55 AM
gmail copy (gmcp) - Perl script to copy files to/from Gmail	Colin Dunstan	Lounge	0	09-04-2004 01:24 PM

01-13-2008, 12:43 PM	#140
Jaapjan Avid reader Posts: 262 Karma: 132 Join Date: Mar 2005 Location: The Netherlands Device: HTC Touch Diamond, iLiad Book Edition	Thanks to you as well as a few palm documentation files from 2001 (and lots of use of a hex editor, ultraedit) I managed to make some prototype C# code that reads out all the raw HTML content. To share with you some information, if you like, it is actually list this: Palm header Palm record index MOBI header (Kind of obvious) EXTH header (A dictionary format set of information about the book) Content (Compressed) Images (Uncompressed) FLIS header (license information) FCIS header (images information) I remain unsure on how to determine where the images start and how long they are. Nor do I know the 2 byte record between content & images nor the 4 byte one at the end. Maybe some sort of checksum.

01-14-2008, 02:48 AM	#149
alexxxm Addict Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...	have you ever thought about implementing a convertion to the Sony LRF format in your suite? I'd really like to write it myself, but that format is not well documented, and/or relies on some Windows DLL... Alessandro