KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 19

DiapDealer · 01-18-2012, 09:00 AM

Quote:

Originally Posted by whoqwerty

Hello I am getting an error on line 886 in <module> sys.exit<main<>> and
line 886 main, in main, infile, outdir = argv[1:]
value errr"too many values to unpack"

I saw in another thread that you may be trying to unpack a print replica book (azw4). Is that the file that's generating this error? I don't know for sure, but we may have temporarily broken the azw4 support in this script with the latest version. Give the previous version a try if you just want to retrieve a PDF from an azw4 file. That version can be found in post #5 in this thread.

EDIT: Nope. I was wrong. The latest experimental version should work for Print Replica books. Just make sure you have the command line right:

Code:

python mobi_unpack.py filename.azw4 output/

"output" being the name of the folder where you want mobi_unpack to dump all its output.

Neonical · 01-19-2012, 02:02 AM

Quote:

Originally Posted by DiapDealer

I saw in another thread that you may be trying to unpack a print replica book (azw4). Is that the file that's generating this error? I don't know for sure, but we may have temporarily broken the azw4 support in this script with the latest version. Give the previous version a try if you just want to retrieve a PDF from an azw4 file. That version can be found in post #5 in this thread.

EDIT: Nope. I was wrong. The latest experimental version should work for Print Replica books. Just make sure you have the command line right:

Code:

python mobi_unpack.py filename.azw4 output/

"output" being the name of the folder where you want mobi_unpack to dump all its output.

I'm sorry I'm still completely new to this whole process, and I know this is a dumb question.

I've installed ActivePython 2.7.2 64 bit link directed on this threat
I have downloaded your latest zip mobi_unpack_updated4.zip
I don't know what to do next with the things I have. IM just confused, this is all new to me.
I've tried to fish through all the pages of this threat but its not much help to me.

Can you please explain to me what it is I need to do exactly. Hopefully I've understood atleast a little in knowing this is how to convert my azw4 file.

Thank you

nickredding · 01-27-2012, 03:17 AM

Following the discussion in the thread Generating mobi and KF8 parts of Kindle file from separate sources I have updated mobi_unpack to (a) create mobi7-only and mobi8-only variants of the input file in the case where it is a combo mobi7/mobi8 file and (b) properly handle unpacking mobi8-only files. The updated zip folder is attached.

I've tested with the kindle previewer and my Fire. The only wrinkle I've seen is if I try to open the mobi7-only file in the previewer with device=Fire it chokes. This doesn't happen if I open the mobi7-only file on my actual Fire. The previewer also works properly with the mobi7-only file if I set device=Kindle. This may be an issue with the zero-length RESC and FONT records in the mobi7-only files, who knows. I'm leaving that issue open because I don't think it's important and might actually reflect a bug in the previewer. Incidentally I don't have my Kindle3 with me so I haven't checked the mobi7-only files on it.

KevinH · 01-27-2012, 12:16 PM

Hi Nick,

I am integrating your changes into my own version of mobi_unpack_update5 (a few minor updates to what DiapDealer had posted to increase robustness when no css is provided, no ncx exists, etc) and I can't figure out the following.

In your split version you use as mobi header offsets:

first_content_index = 192 (or 0xc0 hex)
last_content_index = 194 (or 0xc2 hex)

You never access first_content_index but you do access last_content_index via >H to find the lastimage as follows:

lastimage = getint(datain_rec0,last_content_index,'H')

Yet my updated mobi_unpack code which was based on testing kindlegen output both when no css is provided (so rawml need never be split since there are no flow pieces), and when multiple css sheets are provided (multiple flow pieces (or svg pieces)) makes use of the following:

# need to use the FDST record to find out how to properly unpack
# the rawML into pieces
# it is simply a table of start and end locations for each flow piece
self.fdst = 0xffffffff
self.fdst, = struct.unpack_from('>L', self.header, 0xc0)
self.fdstcnt, = struct.unpack_from('>L', self.header, 0xc4)
# if cnt is 1 or less, fdst section number can be garbage
if self.fdstcnt <= 1:
self.fdst = 0xffffffff
if self.fdst != 0xffffffff:
self.fdst += self.start

But *only* if this is inside a KF8 Modi Header:

mobi unpack code uses:

# Offset Format Meaning
# ------ ------ -------------
# 0xc0 >L FDST start
# 0xc4 >L Number of records inside FDST

So it appears to me that 0xc0 is either a variable length field in a structure that we have yet to find the a proper indicator for .... or ... its size and meaning is different inside older mobi headers and newer mobi headers.

older mobi header

# Offset Format Meaning
# ------ ------ -------------
# 0xc0 >H first_content_index
# 0xc2 >H last_content_index

kf8 mobi header

# Offset Format Meaning
# ------ ------ -------------
# 0xc0 >L FDST start
# 0xc4 >L Number of records inside FDST

Is this your understanding as well?

Thanks,

KevinH

KevinH · 01-27-2012, 12:30 PM

Hi,

1. You need to properly install ActiveState Active Python 2.7.2 on your machine.

2. You need to unzip the mobi_unpack_update4.zip archive

3. to make things simpler, copy the mobi_unpack_update5 folder to your Desktop

4. then copy the .azw4 (nodrm) ebook into the mobi_unpack_update5 folder now on your Desktop

5. You need to open a terminal session (command line) to run the program,
this can typically be done by running cmd.exe

5. You need to use the "cd" command line command to change directory to your Desktop and then into the mobi_unpack_update4 folder and then run the following commands (one per line hitting return after each)

mkdir output
python .\mobi_unpack.py YOUREBOOKNAMEHERE.azw4 output\

Once it completes, you should be able to find the .pdf file inside of the output folder that was created. You might have to fish around to find it.

FYI: if you do not know command line tools, then you would be better off getting the MobiUnpack.pyw gui program that interfaces to the version 32 mobiunpack.py program and using it. If you want instructions on how to do that instead of using the command line, just let us know.

KevinH

Quote:

Originally Posted by Neonical

I'm sorry I'm still completely new to this whole process, and I know this is a dumb question.

I've installed ActivePython 2.7.2 64 bit link directed on this threat
I have downloaded your latest zip mobi_unpack_updated4.zip
I don't know what to do next with the things I have. IM just confused, this is all new to me.
I've tried to fish through all the pages of this threat but its not much help to me.

Can you please explain to me what it is I need to do exactly. Hopefully I've understood atleast a little in knowing this is how to convert my azw4 file.

Thank you

nickredding · 01-27-2012, 01:22 PM

Quote:

Originally Posted by KevinH

But *only* if this is inside a KF8 Modi Header:

mobi unpack code uses:

# Offset Format Meaning
# ------ ------ -------------
# 0xc0 >L FDST start
# 0xc4 >L Number of records inside FDST

So it appears to me that 0xc0 is either a variable length field in a structure that we have yet to find the a proper indicator for .... or ... its size and meaning is different inside older mobi headers and newer mobi headers.

Kevin - I just looked at 2 different KF8 files and you are correct (I wondered why "first_content_index" in the KF8 header was zero instead of 1). For KF8 headers 0xC0 is an >L indicating the FDST record number and 0xC4 is an >L indicating the number of sections in the FDST record. I guess the kindle previewer and Fire just don't need first_content_index, assuming it's 1.

Quite strange that Amazon is formatting the header differently for KF8--most software people would agree that's a dumb thing to do because it's error-prone.

One thing that's still not clear is what (if any) is the significance of the >L at 0xC4 in the mobi7 header? There is no doubt that for these headers the first and last content indexes are two >H starting at 0xC0 but the following 4 bytes are 0x1 for a kindlegen2-generated file and 0x2 for Jerome.prc.

Do you want to fix the mobi-split code or shall I?

KevinH · 01-27-2012, 01:41 PM

Hi Nick,

Quote:

Originally Posted by nickredding

Do you want to fix the mobi-split code or shall I?

Will you please fix it? I am not completely up to speed on what you have written yet and I don't want to mess it up. Our approaches and notations are different enough that I have think about what you wrote and change it in my mind to how I think about things.

I will take your next version and incorporate my minor robustness changes back into your code base.

Thanks,

Kevin

ps. besides updating the FDST start I think we also need to update the DATP start which is at 0x100 if I am not mistaken. I still have no idea what the DATP section is and if it is really needed.

KevinH · 01-27-2012, 01:53 PM

Hi,

I have one other test case that has multiple css sheets and svg pieces and that grows the number of flow items and therefore the fdst count.

I will check to see what it puts for those 4 bytes in the older mobi header, to see if it clears up anything.

KevinH

Quote:

Originally Posted by nickredding

Kevin - I just looked at 2 different KF8 files and you are correct (I wondered why "first_content_index" in the KF8 header was zero instead of 1). For KF8 headers 0xC0 is an >L indicating the FDST record number and 0xC4 is an >L indicating the number of sections in the FDST record. I guess the kindle previewer and Fire just don't need first_content_index, assuming it's 1.

Quite strange that Amazon is formatting the header differently for KF8--most software people would agree that's a dumb thing to do because it's error-prone.

One thing that's still not clear is what (if any) is the significance of the >L at 0xC4 in the mobi7 header? There is no doubt that for these headers the first and last content indexes are two >H starting at 0xC0 but the following 4 bytes are 0x1 for a kindlegen2-generated file and 0x2 for Jerome.prc.

Do you want to fix the mobi-split code or shall I?

KevinH · 01-27-2012, 02:34 PM

Hi,

Even my new testcase showed just a 1 for those 4 bytes in the older mobi header. I have no idea what those bytes might mean.

My testcase did freak out the split code. I print the ofs and it values used in deletesectionrange and end up with an invalid ofs (it is negative) possibly because one of the original start values was 0xffffffff ? I am not sure.

I will try to track this down.

in split 9444436 418
in split 4220892 194
in split 976 196
in split -9443580 198

Traceback (most recent call last):
File "./mobi_unpack.py", line 919, in <module>
sys.exit(main())
File "./mobi_unpack.py", line 910, in main
unpackBook(infile, outdir)
File "./mobi_unpack.py", line 575, in unpackBook
mobisplit = mobi_split(infile)
File "/Users/kbhend/Desktop/nick_mobi_unpack_update5/mobi_split.py", line 242, in __init__
self.result_file8 = deletesectionrange(datain,0,datain_kf8-1)
File "/Users/kbhend/Desktop/nick_mobi_unpack_update5/mobi_split.py", line 105, in deletesectionrange
dataout = dataout[:first_pdb_record+i*8] + struct.pack('>L',ofs) + struct.pack('L',it) + dataout[first_pdb_record+i*8+8:]
struct.error: integer out of range for 'L' format code

nickredding · 01-27-2012, 02:53 PM

Quote:

Originally Posted by KevinH

Hi,

Even my new testcase showed just a 1 for those 4 bytes in the older mobi header. I have no idea what those bytes might mean.

My testcase did freak out the split code. I print the ofs and it values used in deletesectionrange and end up with an invalid ofs (it is negative) possibly because one of the original start values was 0xffffffff ? I am not sure.

I will try to track this down.

in split 9444436 418
in split 4220892 194
in split 976 196
in split -9443580 198

Traceback (most recent call last):
File "./mobi_unpack.py", line 919, in <module>
sys.exit(main())
File "./mobi_unpack.py", line 910, in main
unpackBook(infile, outdir)
File "./mobi_unpack.py", line 575, in unpackBook
mobisplit = mobi_split(infile)
File "/Users/kbhend/Desktop/nick_mobi_unpack_update5/mobi_split.py", line 242, in __init__
self.result_file8 = deletesectionrange(datain,0,datain_kf8-1)
File "/Users/kbhend/Desktop/nick_mobi_unpack_update5/mobi_split.py", line 105, in deletesectionrange
dataout = dataout[:first_pdb_record+i*8] + struct.pack('>L',ofs) + struct.pack('L',it) + dataout[first_pdb_record+i*8+8:]
struct.error: integer out of range for 'L' format code

Kevin - if you pm me the file I'll debug the mobi_split code

nickredding · 01-27-2012, 03:18 PM

Quote:

Originally Posted by KevinH

My testcase did freak out the split code. I print the ofs and it values used in deletesectionrange and end up with an invalid ofs (it is negative) possibly because one of the original start values was 0xffffffff ? I am not sure.

I just noticed that I should have checked the offsets being adjusted for the mobi8 file, so that could have been the problem. I've attached an updated mobi_split.py with all the fixes noted so far.

EDIT: found a couple of problems in deletesectionrange and updated the attachment to fix them

KevinH · 01-27-2012, 03:24 PM

Hi Nick,

Bug was ...

Code:

dataout = dataout[:first_pdb_record+i*8]+\
                          struct.pack('>L',ofs)+struct.pack('L',it)+\
                          dataout[first_pdb_record+i*8+8:]

needs a '>L', not an plain 'L' to keep things properly big endian

Code:

dataout = dataout[:first_pdb_record+i*8]+\
                          struct.pack('>L',ofs)+struct.pack('>L',it)+\
                          dataout[first_pdb_record+i*8+8:]

once I made those changes then my earlier error message goes away.

Take care,

KevinH

nickredding · 01-27-2012, 03:30 PM

Quote:

Originally Posted by KevinH

Hi Nick,

Bug was ...

Code:

dataout = dataout[:first_pdb_record+i*8]+\
                          struct.pack('>L',ofs)+struct.pack('L',it)+\
                          dataout[first_pdb_record+i*8+8:]

needs a '>L', not an plain 'L' to keep things properly big endian

Code:

dataout = dataout[:first_pdb_record+i*8]+\
                          struct.pack('>L',ofs)+struct.pack('>L',it)+\
                          dataout[first_pdb_record+i*8+8:]

once I made those changes then my earlier error message goes away.

Take care,

KevinH

Yes, I just discovered that, plus another issue so get the new mobi_split.py from my previous msg, both issues fixed as well as the others you mentioned

nickredding · 01-27-2012, 07:18 PM

attached, with fixes by Kevin and myself

KevinH · 01-29-2012, 02:26 PM

Hi All,

We have fixed a few more bugs for mobi_unpack.py and added a GUI front-end for people who do not like or know how to use command lines.

On Windows, this program requires a proper and full install of ActiveState Active Python 2.7.X (free community edition) to get the proper graphical user interface widgets installed.

On Linux and Mac, your machines should work out-of-the-box.

1. Download the attached Mobi_Unpack.zip

2. Unzip it (right-click and "Extract All" in Windows)

3. Inside the newly extracted Mobi_Unpack folder

double-click Mobi_Unpack.pyw

4. In the window that pops up:

- Hit the first Browse... button and select your input mobi ebook file

- Hit the second Browse... button and select a destination folder for the unpacked files

- If you want to split combination mobis, examine the raw markup language, or turn on verbose debugging check the appropriate boxes

- Hit the "Start" button -

The unpacking will start and progress messages and any errors will be indicated in the scrollable Log window. If you run into problems, this Log output may be useful in finding and fixing the issue.

Then look in your destination folder for the results.

Please give it a try and let us know if you run into any bugs.

01-27-2012, 03:24 PM	#282
KevinH Sigil Developer Posts: 8,885 Karma: 6120478 Join Date: Nov 2009 Device: many	Hi Nick, Bug was ... Code: dataout = dataout[:first_pdb_record+i8]+\ struct.pack('>L',ofs)+struct.pack('L',it)+\ dataout[first_pdb_record+i8+8:] needs a '>L', not an plain 'L' to keep things properly big endian Code: dataout = dataout[:first_pdb_record+i8]+\ struct.pack('>L',ofs)+struct.pack('>L',it)+\ dataout[first_pdb_record+i8+8:] once I made those changes then my earlier error message goes away. Take care, KevinH

01-29-2012, 02:26 PM	#285
KevinH Sigil Developer Posts: 8,885 Karma: 6120478 Join Date: Nov 2009 Device: many	Mobi_Unpack with support for KF8 mobi ebooks with Graphical User Interface Hi All, We have fixed a few more bugs for mobi_unpack.py and added a GUI front-end for people who do not like or know how to use command lines. On Windows, this program requires a proper and full install of ActiveState Active Python 2.7.X (free community edition) to get the proper graphical user interface widgets installed. On Linux and Mac, your machines should work out-of-the-box. 1. Download the attached Mobi_Unpack.zip 2. Unzip it (right-click and "Extract All" in Windows) 3. Inside the newly extracted Mobi_Unpack folder double-click Mobi_Unpack.pyw 4. In the window that pops up: - Hit the first Browse... button and select your input mobi ebook file - Hit the second Browse... button and select a destination folder for the unpacked files - If you want to split combination mobis, examine the raw markup language, or turn on verbose debugging check the appropriate boxes - Hit the "Start" button - The unpacking will start and progress messages and any errors will be indicated in the scrollable Log window. If you run into problems, this Log output may be useful in finding and fixing the issue. Then look in your destination folder for the results. Please give it a try and let us know if you run into any bugs. Last edited by KevinH; 02-12-2012 at 02:11 PM. Reason: remove old version to prevent confusion

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

01-27-2012, 12:16 PM	#274
KevinH Sigil Developer Posts: 8,885 Karma: 6120478 Join Date: Nov 2009 Device: many	Hi Nick, I am integrating your changes into my own version of mobi_unpack_update5 (a few minor updates to what DiapDealer had posted to increase robustness when no css is provided, no ncx exists, etc) and I can't figure out the following. In your split version you use as mobi header offsets: first_content_index = 192 (or 0xc0 hex) last_content_index = 194 (or 0xc2 hex) You never access first_content_index but you do access last_content_index via >H to find the lastimage as follows: lastimage = getint(datain_rec0,last_content_index,'H') Yet my updated mobi_unpack code which was based on testing kindlegen output both when no css is provided (so rawml need never be split since there are no flow pieces), and when multiple css sheets are provided (multiple flow pieces (or svg pieces)) makes use of the following: # need to use the FDST record to find out how to properly unpack # the rawML into pieces # it is simply a table of start and end locations for each flow piece self.fdst = 0xffffffff self.fdst, = struct.unpack_from('>L', self.header, 0xc0) self.fdstcnt, = struct.unpack_from('>L', self.header, 0xc4) # if cnt is 1 or less, fdst section number can be garbage if self.fdstcnt <= 1: self.fdst = 0xffffffff if self.fdst != 0xffffffff: self.fdst += self.start But only if this is inside a KF8 Modi Header: mobi unpack code uses: # Offset Format Meaning # ------ ------ ------------- # 0xc0 >L FDST start # 0xc4 >L Number of records inside FDST So it appears to me that 0xc0 is either a variable length field in a structure that we have yet to find the a proper indicator for .... or ... its size and meaning is different inside older mobi headers and newer mobi headers. older mobi header # Offset Format Meaning # ------ ------ ------------- # 0xc0 >H first_content_index # 0xc2 >H last_content_index kf8 mobi header # Offset Format Meaning # ------ ------ ------------- # 0xc0 >L FDST start # 0xc4 >L Number of records inside FDST Is this your understanding as well? Thanks, KevinH

01-27-2012, 02:34 PM	#279
KevinH Sigil Developer Posts: 8,885 Karma: 6120478 Join Date: Nov 2009 Device: many	Hi, Even my new testcase showed just a 1 for those 4 bytes in the older mobi header. I have no idea what those bytes might mean. My testcase did freak out the split code. I print the ofs and it values used in deletesectionrange and end up with an invalid ofs (it is negative) possibly because one of the original start values was 0xffffffff ? I am not sure. I will try to track this down. in split 9444436 418 in split 4220892 194 in split 976 196 in split -9443580 198 Traceback (most recent call last): File "./mobi_unpack.py", line 919, in <module> sys.exit(main()) File "./mobi_unpack.py", line 910, in main unpackBook(infile, outdir) File "./mobi_unpack.py", line 575, in unpackBook mobisplit = mobi_split(infile) File "/Users/kbhend/Desktop/nick_mobi_unpack_update5/mobi_split.py", line 242, in __init__ self.result_file8 = deletesectionrange(datain,0,datain_kf8-1) File "/Users/kbhend/Desktop/nick_mobi_unpack_update5/mobi_split.py", line 105, in deletesectionrange dataout = dataout[:first_pdb_record+i8] + struct.pack('>L',ofs) + struct.pack('L',it) + dataout[first_pdb_record+i8+8:] struct.error: integer out of range for 'L' format code