[Old Thread] HTML to MOBI for Kindle - Page 2

KevinH · 04-15-2011, 02:50 PM

Hi Kovid

I was basing that on the parsing done by mobiunpack.py to get the starting offset of each section. The difference in starting offsets determines the section length.

Since section 0 contains the extended header, it;s size is the difference in the starting positions of section 0 and section 1.

For my test case under Calibre this provides:

going to load section 0 now
loading section 0
before: 2912 and after: 3472

as the starting offset and the ending offsets. This provides a size of 3472-2912 = 560 bytes for the extended header (section 0)

For my test case under KindleGen this provides:

loading section 0
before: 3816 and after: 12484

as the starting and ending offsets. This provides a size of 8668 bytes.

Perhaps there is a bug in mobiunpack.py in how it does sections but if you actually open the KindleGen produced book in emacs, you can see the almost 8000 bytes of nulls right where it says it should be.

Here is the code snippet that does the sectioning in mobiunpack.py (for what it is worth).

Code:

class Sectionizer:
        def __init__(self, filename, perm):
                self.f = file(filename, perm)
                header = self.f.read(78)
                self.ident = header[0x3C:0x3C+8]
                self.num_sections, = struct.unpack_from('>H', header, 76)
                print "number of sections ", self.num_sections
                sections = self.f.read(self.num_sections*8)
                self.sections = struct.unpack_from('>%dL' % (self.num_sections*2), sections, 0)[::2] + (0xfffffff, )
                for z in xrange(self.num_sections):
                        print z, " ", self.sections[z]

        def loadSection(self, section):
                print "loading section ", section
                before, after = self.sections[section:section+2]
                print "before: ", before, " and after: ", after
                self.f.seek(before)
                return self.f.read(after - before)

KevinH · 04-15-2011, 03:02 PM

Hi Kovid,

Could your term "record 0" and my term "section 0" be talking about different things? I think your record 0 includes everything up to where the my section 0 starts. Mobiunpack numbers sections starting from 0 and does not offset it by 1.

According to mobiunpack for my test case (a randomly chosen epub converted to mobi) my section 0 starts at 2912 (Calibre version) and at 3816 (KindleGen version) which are quite close and both values are greater than your record 0. But the next section does not begin until 3472 for the Calibre versus versus 12484 for the KindleGen version.

Mobiunpack starts to read the code for the extended header 16 bytes inside of section 0 (provided above).

Perhaps the code just references things in different ways? When I talk about section 0, I am talking about where the extended header information is stored in the file. I think that must be record "1" in your code? I will grab the calibre src and take a look to see.

KevinH

kovidgoyal · 04-15-2011, 03:12 PM

Do you mean the length of the EXTH header?

By record 0, I just mean the first record in the Palm Database.

record 0 contains a Palmdoc header, a MOBI header and an EXTH header, for details, see: https://wiki.mobileread.com/wiki/MOBI

KevinH · 04-15-2011, 03:26 PM

Hi Kovid,

Section 0 in mobiunpack terms is where the EXTH information is stored. I am not exactly referring to the size of the exth information itself but instead to the size of the section where the exth info is stored.

I will stare at the Calibre mobi code and try to see where we differ here.

kovidgoyal · 04-15-2011, 03:43 PM

I'm attaching a trivial test.mobi file. Can you tell me what the length of your section 0 is in this file.

KevinH · 04-15-2011, 03:50 PM

Hi Kovid,

Here is what Mobiunpack says about your test.mobi file:

Unpacking Book ...
number of sections 6

Section Number, Starting Offset
0, 128
1, 2580
2, 2754
3, 2756
4, 2792
5, 2836

# Prints this just before it parses the exth header info
going to load section 0 now
loading section 0
before: 128 and after: 2580

# Offsets within this section
book title offset 384
offset to start of extended header 248
extended header length 132
extended header num_items 5

# Metadata keys and values stored in exth
Creator -> Unknown
Updated Title -> t
ASIN -> fa943046-2dcc-4be7-8ac0-888ef7e626ae
501 -> 45424f4b
Published -> 2011-04-15T16:50:21.482609+00:00

That means the length of my section 0 for this file is 2580 - 128 = 2452 bytes

kovidgoyal · 04-15-2011, 04:10 PM

That is what I would expect. Can you attach the MOBI you generated?

kovidgoyal · 04-15-2011, 04:12 PM

I think the size you are referring to is the number of extra bytes in section 0 after the end of EXTH header? because the length of a calibre produces section 0 is always atleast 2452 bytes

Hitch · 04-15-2011, 04:22 PM

Quote:

Originally Posted by ldolse

@Hitch, I don't think anyone disbelieves you, I think the OP is just getting a bit confused between the discussion of formatting vs. DRM.

@eggheadbooks1, what you're seeing in the Kindle Previewer is exactly what you'll see on a real Kindle (note that's not true of KindleforPC/KindleforMac). People use Calibre generated mobis all the time on real Kindles, and many have noted the same thing you noted in your testing, namely that Calibre generally does a more accurate conversion from epub to mobi than Amazon's own tools. AZW and Mobi are exactly the same to the best of my knowledge, Amazon just changes the extension, and the DRM is very slightly changed over mobipocket's original DRM scheme. If you're not planning on publishing to Amazon you don't need to worry about the rest of the discussion.

Hi, Idolse:

No, she's not confused. I was quite clear about the difference in the KDP-forum thread, and I'm pretty sure the "...some forum user who may or may not have his own agenda" crack is fairly precise language--and the OP is, after all, an author (who has published several discussions about her expertise in Kindle formatting, as well as blog articles, etc. all in the last 2 weeks on the KDP forums). Pile on the authoritative "I for one have never heard of this" and Kovid's comment about Amazon making a "ridiculous excuse" and I could either appear stupid, ill-informed OR someone who has an agenda, and I simply want to stop this in its tracks before it goes further.

I also told her that mobi and azw are precisely the same (along with prc, for that matter), and that azw is naught more than--literally--a MBPC-generated book with a proprietary file extension, and that the mobi/prc/azw extension had absolutely nothing to do with the issue. The entire discussion came about because the OP does not wish to use CSS to eliminate Kindle's default first-line indent on paragraphs, and announced to the KDP that using Calibre "fixed" the issue. I didn't want some poor noob author using Calibre to produce a book and then have books returned by unhappy users (as did my client, way back when this came up), because everyone on that list is publishing to the KDP, everyone is doing commercial production, even if it is only one book at a time. So I responded on that list, in an attempt to head off someone else's misfortune. I shouldn't have let snark get up my nose, here on MR, but it's been a long few weeks here at Booknook, so my "diplomacy-low light" is flashing. ;-)

@Kovid: and yes, I reported there as well that you were not motivated to chase it down. The OP retorted that she'd ask you herself. So I guess you've been asked, and now replied on the topic a second time.

Thanks, @Idolse! @Kovid, nice to see ya.

Hitch

KevinH · 04-15-2011, 04:36 PM

Quote:

Originally Posted by kovidgoyal

I think the size you are referring to is the number of extra bytes in section 0 after the end of EXTH header? because the length of a calibre produces section 0 is always atleast 2452 bytes

Hi Kovid,

Yes this is extra size in section 0 after the EXTH header. However my version of calibre did not allocate 2452 bytes at all. Is that new in 0.7.55?

I privately emailed you the test2.mobi file created via Calibre since the book I randomly chose was a commercial ebook (and therefore should not be posted in the forum).

To be specific, to create that test2.mobi I used Calibre on a Mac (0.7.54) to import a non-drm .epub file. I then converted it to .mobi via Calibre (using the default settings only) and then used the "Save to Disk" to save the .mobi and .epub to my hard drive where I copied the resulting .mobi out and renamed it to test2.mobi which I sent to you.

Either way, the KindleGen version of the book has a size of over 8668 bytes for the total size of the section 0. This allows them to pretty much extend/change the exth section in any way they want without having to change any of the other records in the pdb format.

If you want the KindleGen version, please let me know and I will e-mail it to you (it is twice the size).

Thanks,

KevinH

kovidgoyal · 04-15-2011, 04:43 PM

I think what's happening is the set MOBI metadata code in calibre is truncating record 0 to the minimum possible size. You can confirm this by not using save to disk to get the MOBI out of the calibre library.

KevinH · 04-15-2011, 04:56 PM

Quote:

Originally Posted by kovidgoyal

I think what's happening is the set MOBI metadata code in calibre is truncating record 0 to the minimum possible size. You can confirm this by not using save to disk to get the MOBI out of the calibre library.

Yes, you are correct.

Here are the output sections starting offsets if I simply manually copy the .mobi file out of the Calibre Library.

Unpacking Book ...
number of sections 354
0 2912
1 5364
2 6548
3 7612
4 8676
5 9736
6 10799
7 12194
8 14307

So in this case Section 0 starts at 2912 and ends at 5364 for a size of exactly the 2452 bytes you said it would be.

Perhaps the Amazon reject came because of using "Save to Disk" which made the exth region so tight there was not enough space to easily change it without rewriting the entire actual file.

My guess is they keep exactly one binary DRM image of each book and write the EXTH info on the fly to correspond to the version of Kindle you have (K4Mac, K4PC, etc) in order to encode the PID information needed for that reader/platform/device.

kovidgoyal · 04-15-2011, 05:04 PM

If that's the case, I'm really surprised at Amazon. Do they lack the ability to write a simple script to insert space into a MOBI record0 and update the offsets table accordingly. calibre has been able to do that for *years*.

Oh well, I'll have the next release of calibre guarantee that there is always 8KB worth of null bytes immediately after the exth header in record0, when creating MOBI files and when updating mobi metadata.

KevinH · 04-15-2011, 05:19 PM

Hi Kovid,

Quote:

Originally Posted by kovidgoyal

If that's the case, I'm really surprised at Amazon. Do they lack the ability to write a simple script to insert space into a MOBI record0 and update the offsets table accordingly. calibre has been able to do that for *years*.

Oh well, I'll have the next release of calibre guarantee that there is always 8KB worth of null bytes immediately after the exth header in record0, when creating MOBI files and when updating mobi metadata.

I am not sure if that is the reason or not. All of this is just a guess on my part! And one I can't even test since I do not publish on Amazon (or anywhere else!).

Assuming that encoding a book with DRM is costly on a large scale, I would only encrypt each book once and then have extra space in record 0 where I could write the real encryption key encrypted with a set of PIDs and the metadata values needed to figure out at least one of those PIDs based on information stored on the device itself, and my own registration information, as well as metadata information taken from the book itself.

Then a simple rewrite of section 0 can handle all devices/apps/readers and not changes need be made.

As you said, it is very easy to redo the offset tables so all of this might just be a waste of time!

KevinH

KevinH · 04-15-2011, 05:30 PM

Hi Kovid,

I ran mobiunpack on 3 encrypted Kindle for Mac books and of course it failed (since the books were encrypted) but the section 0 offsets did get printed and every one of the 3 was over 8000 bytes (8756, ...) long.

Maybe they were all generated by KindleGen? So it looks like 8K of null bytes added to the end is a safe value? Then again, I may be way off base here.

Take care,

KevinH

04-15-2011, 02:50 PM	#16
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Kovid I was basing that on the parsing done by mobiunpack.py to get the starting offset of each section. The difference in starting offsets determines the section length. Since section 0 contains the extended header, it;s size is the difference in the starting positions of section 0 and section 1. For my test case under Calibre this provides: going to load section 0 now loading section 0 before: 2912 and after: 3472 as the starting offset and the ending offsets. This provides a size of 3472-2912 = 560 bytes for the extended header (section 0) For my test case under KindleGen this provides: loading section 0 before: 3816 and after: 12484 as the starting and ending offsets. This provides a size of 8668 bytes. Perhaps there is a bug in mobiunpack.py in how it does sections but if you actually open the KindleGen produced book in emacs, you can see the almost 8000 bytes of nulls right where it says it should be. Here is the code snippet that does the sectioning in mobiunpack.py (for what it is worth). Code: class Sectionizer: def __init__(self, filename, perm): self.f = file(filename, perm) header = self.f.read(78) self.ident = header[0x3C:0x3C+8] self.num_sections, = struct.unpack_from('>H', header, 76) print "number of sections ", self.num_sections sections = self.f.read(self.num_sections8) self.sections = struct.unpack_from('>%dL' % (self.num_sections2), sections, 0)[::2] + (0xfffffff, ) for z in xrange(self.num_sections): print z, " ", self.sections[z] def loadSection(self, section): print "loading section ", section before, after = self.sections[section:section+2] print "before: ", before, " and after: ", after self.f.seek(before) return self.f.read(after - before)

04-15-2011, 03:12 PM	#18
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Do you mean the length of the EXTH header? By record 0, I just mean the first record in the Palm Database. record 0 contains a Palmdoc header, a MOBI header and an EXTH header, for details, see: https://wiki.mobileread.com/wiki/MOBI Last edited by kovidgoyal; 04-15-2011 at 03:17 PM.

04-15-2011, 03:50 PM	#21
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Kovid, Here is what Mobiunpack says about your test.mobi file: Unpacking Book ... number of sections 6 Section Number, Starting Offset 0, 128 1, 2580 2, 2754 3, 2756 4, 2792 5, 2836 # Prints this just before it parses the exth header info going to load section 0 now loading section 0 before: 128 and after: 2580 # Offsets within this section book title offset 384 offset to start of extended header 248 extended header length 132 extended header num_items 5 # Metadata keys and values stored in exth Creator -> Unknown Updated Title -> t ASIN -> fa943046-2dcc-4be7-8ac0-888ef7e626ae 501 -> 45424f4b Published -> 2011-04-15T16:50:21.482609+00:00 That means the length of my section 0 for this file is 2580 - 128 = 2452 bytes Last edited by KevinH; 04-15-2011 at 04:38 PM. Reason: cleaned it up to make it readable

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
conversion html -> mobi read on kindle	manonoc	Kindle Developer's Corner	4	11-24-2010 11:01 AM
Troubleshooting Kindle DX Graphite html => mobi problem	carterw	Amazon Kindle	2	11-10-2010 04:46 AM
[Old thread] Need help -> Kindle 3, Mobi format, Hebrew	nitzanb	Conversion	2	09-28-2010 06:54 AM
HTML to MOBI text format is off when I get it on Kindle	cloudyvisions	Calibre	5	07-14-2010 12:42 AM
Convert HTML file to MOBI for Kindle	IMFletch	Calibre	5	04-16-2010 01:06 PM

04-15-2011, 03:02 PM	#17
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Kovid, Could your term "record 0" and my term "section 0" be talking about different things? I think your record 0 includes everything up to where the my section 0 starts. Mobiunpack numbers sections starting from 0 and does not offset it by 1. According to mobiunpack for my test case (a randomly chosen epub converted to mobi) my section 0 starts at 2912 (Calibre version) and at 3816 (KindleGen version) which are quite close and both values are greater than your record 0. But the next section does not begin until 3472 for the Calibre versus versus 12484 for the KindleGen version. Mobiunpack starts to read the code for the extended header 16 bytes inside of section 0 (provided above). Perhaps the code just references things in different ways? When I talk about section 0, I am talking about where the extended header information is stored in the file. I think that must be record "1" in your code? I will grab the calibre src and take a look to see. KevinH

04-15-2011, 03:26 PM	#19
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Kovid, Section 0 in mobiunpack terms is where the EXTH information is stored. I am not exactly referring to the size of the exth information itself but instead to the size of the section where the exth info is stored. I will stare at the Calibre mobi code and try to see where we differ here.

04-15-2011, 04:10 PM	#22
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That is what I would expect. Can you attach the MOBI you generated?

04-15-2011, 04:12 PM	#23
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I think the size you are referring to is the number of extra bytes in section 0 after the end of EXTH header? because the length of a calibre produces section 0 is always atleast 2452 bytes

04-15-2011, 04:43 PM	#26
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I think what's happening is the set MOBI metadata code in calibre is truncating record 0 to the minimum possible size. You can confirm this by not using save to disk to get the MOBI out of the calibre library.

04-15-2011, 05:04 PM	#28
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If that's the case, I'm really surprised at Amazon. Do they lack the ability to write a simple script to insert space into a MOBI record0 and update the offsets table accordingly. calibre has been able to do that for years. Oh well, I'll have the next release of calibre guarantee that there is always 8KB worth of null bytes immediately after the exth header in record0, when creating MOBI files and when updating mobi metadata.

04-15-2011, 05:30 PM	#30
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Kovid, I ran mobiunpack on 3 encrypted Kindle for Mac books and of course it failed (since the books were encrypted) but the section 0 offsets did get printed and every one of the 3 was over 8000 bytes (8756, ...) long. Maybe they were all generated by KindleGen? So it looks like 8K of null bytes added to the end is a safe value? Then again, I may be way off base here. Take care, KevinH

Advert

Advert