View Full Version : Test oeb2mobi output for me?


llasram
01-06-2009, 09:40 AM
I think I'm getting close with oeb2mobi, but I'd like to have some people with actual Mobipocket-supporting devices give a book a test and provide feedback. The output I'm getting now looks pretty good (if I may say so myself) in Mobipocket Desktop and is legible in the Palm version (run under POSE), with the following caveats:


I'm unsure about the cover and general image file sizes. Mobipocket Desktop seems to display them regardless of file size, while the Palm version seem not to display them, regardless of file size.
Table support. I'm currently turning CSS tables into Mobipocket tables, but I'm not sure if that's the right thing to do. They seem to look fine when viewed moving forward through the text, but get all smashed together when viewed moving backwards. If the table of contents in the attached book is illegible on most devices, I should probably stick to turning explicit HTML tables into tables, with the option to rasterize.


Thanks in advance! :)

tompe
01-06-2009, 10:02 AM
I'm unsure about the cover and general image file sizes. Mobipocket Desktop seems to display them regardless of file size, while the Palm version seem not to display them, regardless of file size.



The Palm version should display images. They have to be less than 64k and I think they have to be in correct format also. jpg worked if I remember correctly.

Jellby
01-06-2009, 10:25 AM
The Cybook has no problem with the cover, and from other files I've tried, it has no problem with images larger than 64kB. It also will resize large images "on the fly" to make them fit in the screen. Other devices may have problems, that's why I'm now uploading illustrated books in two versions, one with the "original" images, and another with images further compressed to take less than 64kB. Also, they have to be in JPG or GIF formats, not in PNG, apparently.

As for the table, the Cybook renders it (but no nested tables, I believe). The problem I see is that it does not cut it at "whole" lines. For instance, in the first page of Contents I see up to the top half of line "12" (but this line is complete in the second page), the third page includes the very top of line "53", and the fourth page has the very bottom of line "52". Nothing is lost, but it's ugly.

Ah... and the "header" line shows the author as "Alexandre Dumas, p& #232;re" (no space after &), I guess it does not like HTML entities there (I regularly use latin1 characters without problems).

pdurrant
01-06-2009, 10:35 AM
Images in Mobipocket files need to be under around 63KiB in order to be displayed on Palm versions of the Mobipocket reader.

See http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=images.htm

The author name in the EXTH records uses è for . This doesn't get displayed correctly in the CyBook library, and it also doesn't get displayed properly in the Desktop Mobipocket Reader properties dialog (right-click/Properties) although it does in the preview I've tested changing this to be encoded as UTF-8, (i.e. hex C3 A8 for ) and this displays correctly in both places in the Desktop reader, and also in the Cybook library display. So I think it would be best to encode all the text in EXTH records as UTF-8, if that's the encoding specified in the MOBI header.

The table of contents displays OK in the Cybook, but it's a bit odd to have two links for each chapter - one of the number and one on the title. It would be better, given the navigation limitations to have either one link covering both, or just a link on the title of each chapter.

In addition, the chapter numbers aren't lining up exactly - and some of them have the left side of the first number clipped. I'm using a non-standard font (Fontin), but I see that Georgia shows the same fault, and Verdana even more - I suspect it's just a table fault - not really enough space for the two numerals and full stop for the higher chapter numbers. This problem doesn't show up in the Windows Desktop reader.

But otherwise, on a brief examination, it seems to be a very nicely formatted and well-formed Mobipocket book. I like the care take on the first paragraph of each chapter, and the indents and (lack of) spacing of other paragraphs.

Are there any particular sections you'd like checked?

llasram
01-06-2009, 11:53 AM
The Cybook has no problem with the cover, and from other files I've tried, it has no problem with images larger than 64kB.

Images in Mobipocket files need to be under around 63KiB in order to be displayed on Palm versions of the Mobipocket reader.

Ah, ok. I thought there was a different size limit for covers, but I probably just missed a warning from mobigen when it couldn't reencode the cover image below 64k. If it's just a PalmOS limitation, then I'll make image file size reduction an option rather than required.

Ah... and the "header" line shows the author as "Alexandre Dumas, p& #232;re" (no space after &), I guess it does not like HTML entities there (I regularly use latin1 characters without problems).

The author name in the EXTH records uses for . This doesn't get displayed correctly in the CyBook library, and it also doesn't get displayed properly in the Desktop Mobipocket Reader properties dialog

Ah, oops. That's my fault, and easily enough fixed.

The table of contents displays OK in the Cybook, but it's a bit odd to have two links for each chapter - one of the number and one on the title. It would be better, given the navigation limitations to have either one link covering both, or just a link on the title of each chapter.

The reason for the "two links" is that the source mark-up uses the CSS 2 'display: table-row' and 'display: table-cell' to create a virtual table where the <a/> element is a row. In CSS 2 compliant renderers the whole contents line is a single link. The only sane way to handle that sort of thing in Mobipocket markup is alas to reproduces the link in each block-level element the source <a/> elemnt encloses.

As for the table, the Cybook renders it (but no nested tables, I believe). The problem I see is that it does not cut it at "whole" lines. For instance, in the first page of Contents I see up to the top half of line "12" (but this line is complete in the second page), the third page includes the very top of line "53", and the fourth page has the very bottom of line "52". Nothing is lost, but it's ugly.

Eww. Ok, I'll ignore 'display: table-*' then.

Are there any particular sections you'd like checked?

Other than what I mentioned I mostly just wanted to make sure the file actually worked. Maybe check to make sure the "uncrossable" sections like the footnotes and CC license work properly. And oggle the poetry bits (e.g. in chapter 26), which in the source are formatted entirely with CSS :).

Jellby
01-06-2009, 12:00 PM
And oggle the poetry bits (e.g. in chapter 26), which in the source are formatted entirely with CSS :).

I believe mobipocket supports a special <p align="poetry">, where if a line is wrapped, the broken part is right-aligned... but last time I tried it didn't work in the Cybook.

When I code poetry, I do something like:

<P WIDTH="2em">Yew are ther boys of the Empire,</P>
<P WIDTH="4em" HEIGHT="0em">Steady an’ brave an’ trew.</P>
<P WIDTH="2em" HEIGHT="0em">Yew are the wuns</P>
<P WIDTH="4em" HEIGHT="0em">She calls ’er sons</P>
<P WIDTH="2em" HEIGHT="0em">An’ I luv yew.</P>

It's a pain to have a <p></p> for every line, but at least I can control the indent.

llasram
01-06-2009, 12:19 PM
It's a pain to have a <p></p> for every line, but at least I can control the indent.

My strategy is somewhat different. The source mark-up uses a combination of 'display' 'block' and 'inline' elements with various margins to produce something that is at least recognizable poetry even with CSS-less rendering. My conversion process turns all 'magin-left' properties of block-level elements in levels of nested <blockquote/> tags and 'margin' properties of inline elements into non-breaking spaces. I need to test it with more mark-up samples, but it works pretty well with what I've thrown at it so far.

Unfortunately, it is still a case of "as close as possible"... Even though I figured out how to do hanging indents in converted markup, the Mobipocket renderer doesn't allow a block with a hanging indent to have a left margin. And yet another quirk of the Mobipocket renderer is that when you have something like <blockquote width="0pt"><font size="2">Text....</font></blockquote> then the second and later lines in the "Text...." are indented less than the first line. This seems to be because <blockquote/> is indenting by "1em" -- so the first line is indented at the default font size, then lines which begin within the <font/> tag are indented at the <font/>-specified font-size.

nrapallo
01-06-2009, 01:02 PM
I tried to convert your .mobi creation to .imp using my Mobi2IMP which is based on tompe's mobi2html and extracted a complete, but somewhat corrupt .html. See below .zip.

I also ran Three_Musketeers,_The.mobi through tompe's latest mobi2html and extracted the below .zip for you to see.

The links (filepos) have a problem and there is some corruption in the TOC.

Please check your source .html to make sure they are not so corrupt.

nrapallo
01-06-2009, 01:09 PM
Images in Mobipocket files need to be under around 63KiB in order to be displayed on Palm versions of the Mobipocket reader.

:offtopic: :) (and not personally addressed to pdurrant {the messenger...})



Why wouldn't a mobi2mobi with the gen3imagefix take care of the image size restrictions for those users that need to use PDA's to display ebooks. I think the larger size can easily be "downconverted" to the 63K image size restricted version, thereby, only necessitating one .mobi upload.

Is there any reason to still cater to PDA users (sorry, but my Sony TH55 has never been my main ebook viewer). :dunno:

llasram
01-06-2009, 01:39 PM
Please check your source .html to make sure they are not so corrupt.

The issue with the links is a detail of the way I'm generating beginning-of-file links in oeb2mobi. The Mobipocket renderers seem to handle them fine, but I'll make sure it generates something Mobiperl's mobi2html can handle cleanly.

llasram
01-06-2009, 08:08 PM
The issue with the links is a detail of the way I'm generating beginning-of-file links in oeb2mobi. The Mobipocket renderers seem to handle them fine, but I'll make sure it generates something Mobiperl's mobi2html can handle cleanly.

Actually, I take it back. The Mobiperl 'mobi2html' errors with the Mobipocket book I've generated appear to be errors with Mobiperl's handling of UTF-8 encoded books. With UTF-8 encoding, each text record is followed by 0 or more "overlapping" bytes finishing the current multibyte character, plus an 8-bit integer count of the overlapping bytes as an additional byte. These additional bytes are not counted as part of the content length for the purposes of computing the "filepos" of link targets.

tompe
01-06-2009, 08:44 PM
Actually, I take it back. The Mobiperl 'mobi2html' errors with the Mobipocket book I've generated appear to be errors with Mobiperl's handling of UTF-8 encoded books. With UTF-8 encoding, each text record is followed by 0 or more "overlapping" bytes finishing the current multibyte character, plus an 8-bit integer count of the overlapping bytes as an additional byte. These additional bytes are not counted as part of the content length for the purposes of computing the "filepos" of link targets.

Well, UTF-8 encoded books does not seem to be so common so I have not noticed this before. Why use UTF-8 and not entities?

Coding UTF-8 or other character set related things is probably the most boring thing I know. When I started using computers there was problem with Swedish characters and ISO 8859-1 support in programs. It is so depressing to have similar kind of problems 25 years later...

kovidgoyal
01-06-2009, 08:56 PM
Well, UTF-8 encoded books does not seem to be so common so I have not noticed this before. Why use UTF-8 and not entities?

Coding UTF-8 or other character set related things is probably the most boring thing I know. When I started using computers there was problem with Swedish characters and ISO 8859-1 support in programs. It is so depressing to have similar kind of problems 25 years later...

You shouldn't need to do anything codec related directly. Surely Perl has builtin codecs for coding/decoding UTF-8, etc?

DaleDe
01-06-2009, 09:00 PM
Well, UTF-8 encoded books does not seem to be so common so I have not noticed this before. Why use UTF-8 and not entities?

Coding UTF-8 or other character set related things is probably the most boring thing I know. When I started using computers there was problem with Swedish characters and ISO 8859-1 support in programs. It is so depressing to have similar kind of problems 25 years later...

Well, if that depresses you then you must really be depressed by the recent Zune fiasco. How long have we known about leap year? It was present in the Julian calendar in 70 bce!

Dale

llasram
01-06-2009, 09:10 PM
Well, UTF-8 encoded books does not seem to be so common so I have not noticed this before. Why use UTF-8 and not entities?

I might... I was just trying to implement support for this in Calibre's mobi2oeb (which doesn't have it either :)) and I realized that I don't have the description of what it's doing quiiiite right. Ugh.

llasram
01-06-2009, 09:15 PM
I might... I was just trying to implement support for this in Calibre's mobi2oeb (which doesn't have it either :)) and I realized that I don't have the description of what it's doing quiiiite right. Ugh.

Ah! It's only the 2 (or maybe 3 or 4...?) low bits which are used for the overlap data. Mobigen sets higher bits to mean... er... something else -- and don't I wish I knew what.

wallcraft
01-06-2009, 09:41 PM
The table of contents displays OK in the Cybook, but it's a bit odd to have two links for each chapter - one of the number and one on the title. It would be better, given the navigation limitations to have either one link covering both, or just a link on the title of each chapter.

In addition, the chapter numbers aren't lining up exactly - and some of them have the left side of the first number clipped. I get the same thing on the EZ Reader (Hanlin V3). The TOC menu item just takes you to the in-lined TOC, and then you follow standard links. With 2 links per TOC entry and the TOC page cuts are not clean. See the 2nd and 3rd attached image (scans of the device). The 1st image shows the cover page, with the bottom obscured by the system footer. This is primarily a limitation of the Hanlin's MOBI image processing, but I think an image smaller than 600x800 would be better (best for the Kindle would be 525x640, although on the Kindle the image is resized automatically). The 4th scan shows the half page left at the start of each Chapter, but the Chapter start looks right.

The last scan is from OpenInkPot (FBReader based), it does not honor the CSS for a chapter start, and the text starts with all caps. In OpenInkPot the TOC works either via following links from the in-lined TOC or by bringing up the TOC menu item and selecting the chapter from a list. This is the same behavior you see with Desktop FBReader.

tompe
01-06-2009, 09:46 PM
You shouldn't need to do anything codec related directly. Surely Perl has builtin codecs for coding/decoding UTF-8, etc?

Yes, but the character count is in the raw data and I do not know if there is support for this. I cannot decode it and then count the positions.

tompe
01-06-2009, 09:52 PM
Ah! It's only the 2 (or maybe 3 or 4...?) low bits which are used for the overlap data. Mobigen sets higher bits to mean... er... something else -- and don't I wish I knew what.

I am waiting for the description on the Wiki before thinking about supporting this:)

I thought UTF-8 was standardized and could only be done one way...

kovidgoyal
01-06-2009, 09:56 PM
Yes, but the character count is in the raw data and I do not know if there is support for this. I cannot decode it and then count the positions.

You can still do this (albeit rather inefficiently) by converting to whatever universal encoding you use, then iterating character by character and re-encoding each character into utf-8 to see how many bytes it takes.

tompe
01-06-2009, 10:01 PM
How do I create a correct UTF-8 encoded book? Is -unicode to mobigen enough?

llasram
01-06-2009, 10:14 PM
The 1st image shows the cover page, with the bottom obscured by the system footer. This is primarily a limitation of the Hanlin's MOBI image processing, but I think an image smaller than 600x800 would be better (best for the Kindle would be 525x640, although on the Kindle the image is resized automatically).

Mobipocket's developer documentation suggested that 600x800 was optimal, but apparently not then...

The 4th scan shows the half page left at the start of each Chapter, but the Chapter start looks right.

That's really weird... The chapter headers are created with a 'height="5em"' attribute, which in Mobipocket speak should produce a top margin of only 5 or so lines (Mobi seems to be quite fuzzy on what an "em" means). Is this happening on the Cybook etc too?

The last scan is from OpenInkPot (FBReader based), it does not honor the CSS for a chapter start, and the text starts with all caps.

Eee -- that bad? There actually isn't any CSS in the mobibook -- just Mobipocket-extended HTML 3.2. If FBReader isn't providing formatting for this then it isn't providing formatting for any HTML content. Nothing at all I can really do there.

llasram
01-06-2009, 10:15 PM
How do I create a correct UTF-8 encoded book? Is -unicode to mobigen enough?

That's what I've been doing, yah. Although you are correct that it seems to be quite, quite rare. I don't have all that many Mobipocket books, but of the ones I do have only one is built that way (one of the most recent Tor freebies).

nrapallo
01-06-2009, 11:18 PM
Actually, I take it back. The Mobiperl 'mobi2html' errors with the Mobipocket book I've generated appear to be errors with Mobiperl's handling of UTF-8 encoded books. With UTF-8 encoding, each text record is followed by 0 or more "overlapping" bytes finishing the current multibyte character, plus an 8-bit integer count of the overlapping bytes as an additional byte. These additional bytes are not counted as part of the content length for the purposes of computing the "filepos" of link targets.

*Thank you* for finally confirming my suspision that the byte count to the filepos/link is "off" in mobi2html (and consequently in Mobi2IMP). I've had to sometimes add upto 200 extra bytes to find the "anchor" tag the filepos was referring to in my conversions from .prc to .imp. I had no idea why I had to do this and never would have thought the UTF-8 decoding could have precipitated this, but it does make awful good sense to me now that you mentioned this!

My Mobi2IMP solution (which was a brute force naive approach) was to scan forward in the uncompressed text (html) from the stated filepos position and look for the first '<' to plop the anchor (for that filepos)! 99% of the times it worked, but it was not elegant nor foolproof!

wallcraft
01-06-2009, 11:24 PM
Mobipocket's developer documentation suggested that 600x800 was optimal, but apparently not then... I agree that is what they suggest, but most EInk devices don't use the full screen for a MOBI image. I'm not sure about the Cybook though.

That's really weird... The chapter headers are created with a 'height="5em"' attribute, which in Mobipocket speak should produce a top margin of only 5 or so lines (Mobi seems to be quite fuzzy on what an "em" means). Is this happening on the Cybook etc too? I don't have a Cybook, but it is happening on the iLiad, see the attached screen shots. It is treating the space as "ems" (smaller space with a smaller font), and part of the problem may be the very wide line spacing used by the Java MobiPocket Reader. However, it looks like there is extra space over 5em.

If FBReader isn't providing formatting for this then it isn't providing formatting for any HTML content. Nothing at all I can really do there. I was not complaining, the only criteria for FBReader is not to cause strange formatting and this is fine.

wallcraft
01-06-2009, 11:39 PM
The TOC does not work on the iLiad. It is possible to call up the TOC, but stylus tapping in the "links" does nothing. In a standard TOC this would take you to the chapter.

nrapallo
01-06-2009, 11:47 PM
The TOC does not work on the iLiad. It is possible to call up the TOC, but stylus tapping in the "links" does nothing. In a standard TOC this would take you to the chapter.

Then, is there something wrong with the original .html source or llasram's conversion to .mobi?

@llasram

Can you post your original .html source file used to create this .mobi ebook in post #1? This way we can verify which is the culprit...:)

llasram
01-07-2009, 12:07 AM
Then, is there something wrong with the original .html source or llasram's conversion to .mobi?

It's definitely my generation code. It may be because the links are in a table? But I'm ditching the CSS-table-as-Mobi-table conversion anyway.

Can you post your original .html source file used to create this .mobi ebook in post #1?

Sure -- it's my EPUB edition of the The Three Musketeers (http://www.mobileread.com/forums/showthread.php?t=28200).

pdurrant
01-07-2009, 03:49 AM
The Cybook displays Mobipocket covers at full screen (so 600x800 is good for the covers on a CyBook), but not in-line images. I don't know what the optimum size is for in-line images on a CyBook, and I suspect it could change with new firmware.

I agree that is what they suggest, but most EInk devices don't use the full screen for a MOBI image. I'm not sure about the Cybook though.

Jellby
01-07-2009, 04:29 AM
The Cybook displays Mobipocket covers at full screen (so 600x800 is good for the covers on a CyBook), but not in-line images. I don't know what the optimum size is for in-line images on a CyBook, and I suspect it could change with new firmware.

I hope new firmwares (or OpenInkpot) will allow to change the margin size and even open an in-line image in full screen. In the meantime, the cybook resizes an image if it's too large, so it's not that bad to have 600x800 in-line images. The usable text block is around 500x650, I'd say...

tompe
01-07-2009, 09:05 AM
*Thank you* for finally confirming my suspision that the byte count to the filepos/link is "off" in mobi2html (and consequently in Mobi2IMP). I've had to sometimes add upto 200 extra bytes to find the "anchor" tag the filepos was referring to in my conversions from .prc to .imp. I had no idea why I had to do this and never would have thought the UTF-8 decoding could have precipitated this, but it does make awful good sense to me now that you mentioned this!

My Mobi2IMP solution (which was a brute force naive approach) was to scan forward in the uncompressed text (html) from the stated filepos position and look for the first '<' to plop the anchor (for that filepos)! 99% of the times it worked, but it was not elegant nor foolproof!

If you give ma a pointer to or a file with this problem I can see if it is easy to fix.

nrapallo
01-07-2009, 09:50 AM
If you give ma a pointer to or a file with this problem I can see if it is easy to fix.

Sure can!

Most Feedbooks.com Mobipocket/Kindle offerings have this problem (as they usually are UTF-8 encoded).

For example, see The Ant King and Other Stories (http://www.feedbooks.com/book/2872) ebook (.mobi).

The extracted files/results of Mobi2IMP are included in the .zip below. The below .txt file shows the dos window output of Mobi2IMP. In particular, look at the section following:Adding name attributes
FIXED 3: 0000026250 (6) - Wasn't an anchor: reak/><a

Note the number in parentheses i.e. 6 shows how many characters over that I had to go to find the first "<". If I could find one within the first 200-300 bytes, then I would print FIXED, otherwise I would just issue a WARNING.

I've always seen this behaviour with Feedbooks.com .prc/.mobi ebooks.

llasram
01-07-2009, 10:36 AM
If you give ma a pointer to or a file with this problem I can see if it is easy to fix.

It looks like Mobiperl also isn't handling the "standard" variable-width integer encoded trailing data indicated by the other bits of the extra data flags field.

pdurrant
01-07-2009, 11:21 AM
I think these must be the bytes that were messing up the first version of the Mobipocket decoder, and that the second version tried to fix without really understanding them. (The fifth version seems to handle them correctly, although I still haven't followed exactly what's going on here.)

It would be very nice to get the controlling bits and format of the trailing bytes set out clearing in the wiki...


It looks like Mobiperl also isn't handling the "standard" variable-width integer encoded trailing data indicated by the other bits of the extra data flags field.

llasram
01-07-2009, 11:55 AM
I think these must be the bytes that were messing up the first version of the Mobipocket decoder, and that the second version tried to fix without really understanding them.

Indeed -- I'm not sure the community as whole really understands them. Calibre's rules for parsing them look to be the same as mobidedrm 0.5's, except that Calibre will ignore the extra data flags field if the MOBI header is shorter than 0xe4 bytes or *longer* than 0xe8 bytes. I vaguely recall being responsible for the test case which led to that one, but I'm now suspecting it may have been an interaction with an earlier version of mobidedrm. And neither handles bit 1 of the extra data flags, although Calibre will when Kovid gets around to pulling from lp:~llasram/calibre/staging :p.

It would be very nice to get the controlling bits and format of the trailing bytes set out clearing in the wiki...

I've got a whole ream of stuff I figured out writing oeb2mobi that I need to add to the wiki... The effect of bit 1 of extra data flags, the format of "uncrossable" boundary records, the format of the FCIS record, and the format of the index records (although I'm still working on that...).

kovidgoyal
01-07-2009, 12:06 PM
And neither handles bit 1 of the extra data flags, although Calibre will when Kovid gets around to pulling from lp:~llasram/calibre/staging :p.


Done. I've also refactored mobi2oeb to use lxml instead of BeautifulSoup for a significant speedup

tompe
01-07-2009, 01:59 PM
It looks like Mobiperl also isn't handling the "standard" variable-width integer encoded trailing data indicated by the other bits of the extra data flags field.

I will wait for the wiki description of this...

llasram
01-07-2009, 02:12 PM
I will wait for the wiki description of this...

Done (http://wiki.mobileread.com/wiki/MOBI#Trailing_entries) ;).

Elsi
01-07-2009, 02:30 PM
You may already know how this displays in the Kindle, but here are some scans of the screen. (Kindle's screenshot function wasn't working; not sure why.) I scanned @ 300dpi, then reduced the image, exported to JPG with 15% optimization, so the fuzziness is due to the scanning, not the screen itself.

On two of the images, I circled a portion of the author field that doesn't display properly. Also, I really like the way the chapters begin 1/2 way down the page & hope to get your CSS so I can apply it to the books I'm making.

llasram
01-07-2009, 02:59 PM
You may already know how this displays in the Kindle, but here are some scans of the screen.

Awesome! Thank you :). Looks pretty good... I may leave the cover generation at 600x800. Hmm...

On two of the images, I circled a portion of the author field that doesn't display properly.

Yarh... Already fixed. Although it looks like it's been re-arranged -- does the Kindle treat the ',' or '&' character specially, like trying to rearrange "Last, First" to "First Last"?

Also, I really like the way the chapters begin 1/2 way down the page & hope to get your CSS so I can apply it to the books I'm making.

Heh. I think that's actually a bug. It's supposed to be only 5 or so lines down, and shows up that way in Mobipocket Desktop. I'm actually hoping I've fixed when I post another build of the file :). But if that's what you want, you should be able to achieve the effect with a 'margin-top: 50%' property on your chapter headers.

Jellby
01-07-2009, 03:04 PM
On two of the images, I circled a portion of the author field that doesn't display properly.

Funny, it seems it takes the semicolon as a separator for different authors, and the comma as the separator between first and last name, so:

Alexandre Dumas, p;re

(considering that the entity is not decoded, it has a semiocolon then) is parsed as:

First author: p (first name) Alexandre Dumas (last name)
Second author: re (last name)

Also, I really like the way the chapters begin 1/2 way down the page & hope to get your CSS so I can apply it to the books I'm making.

I usually try to get something a bit more space-efficient...

wallcraft
01-07-2009, 03:18 PM
I may leave the cover generation at 600x800. I suggest making this tunable. A default of 600x800 would be ok, although 525x640 works well on a wider range of devices.

Elsi
01-07-2009, 04:56 PM
Also, I really like the way the chapters begin 1/2 way down the page & hope to get your CSS so I can apply it to the books I'm making.

Heh. I think that's actually a bug. It's supposed to be only 5 or so lines down, and shows up that way in Mobipocket Desktop. I'm actually hoping I've fixed when I post another build of the file :). But if that's what you want, you should be able to achieve the effect with a 'margin-top: 50%' property on your chapter headers.

I usually try to get something a bit more space-efficient...
I'll agree that 1/2 the page is too far down, but I've not been happy with the default placement as shown in this first image (from The Moving Picture Girls). I also would like to try something like the pseudo-watermark used in the commercial book Fortune and Fate by Sharon Shinn as shown in the second image.

pdurrant
01-07-2009, 05:19 PM
Oh - thank you. Do you have a sample file with bit one set? It looks to me like the mobipocket decoder will need adjustment to cope with such files.

Done (http://wiki.mobileread.com/wiki/MOBI#Trailing_entries) ;).

tompe
01-07-2009, 05:39 PM
Sure can!

Most Feedbooks.com Mobipocket/Kindle offerings have this problem (as they usually are UTF-8 encoded).

For example, see The Ant King and Other Stories (http://www.feedbooks.com/book/2872) ebook (.mobi).


I looked at the output from --rawhtml but could not find any UTF-8 characters... But there is null characters in the file. But that is the data directly from the Perl module unpacking the compressed data so this is probably releated to something else. UTF-8 ought not to produce null characters.

I thought that the unpacking of the record was totally independent on the character set used. Right or wrong?

tompe
01-07-2009, 05:58 PM
Done (http://wiki.mobileread.com/wiki/MOBI#Trailing_entries) ;).

Concerning multibyte character overlap. How do you know which byte is the size byte?

Are these characters and the trailing data part of the record size or are they outside the specified record size?

tompe
01-07-2009, 06:00 PM
I looked at the output from --rawhtml but could not find any UTF-8 characters... But there is null characters in the file. But that is the data directly from the Perl module unpacking the compressed data so this is probably releated to something else. UTF-8 ought not to produce null characters.


The extra data flag is set to 0x31 for this file. So the extra characters are probably something from the unpacking of the data. The unpacking does not know about the extra data.

Hadrien
01-07-2009, 07:03 PM
Most Feedbooks.com Mobipocket/Kindle offerings have this problem (as they usually are UTF-8 encoded).


We use UTF-8 on 100% of our files actually.

llasram
01-07-2009, 07:03 PM
The extra data flag is set to 0x31 for this file. So the extra characters are probably something from the unpacking of the data. The unpacking does not know about the extra data.

If the '1' bit is set, and there are no actual multibyte characters in the text, then each record will end with a NUL byte indicating 0 overlaping bytes. (Well, unless bits one of bits 4-8 is set on the "size & flags" byte.)

Concerning multibyte character overlap. How do you know which byte is the size byte?

It's the last byte of that trailing entry.

Are these characters and the trailing data part of the record size or are they outside the specified record size?

As I understood it, the "record size" was just the distance to the next record. In which case yes, they are part of the record they follow.

llasram
01-07-2009, 07:10 PM
The extra data flag is set to 0x31 for this file. So the extra characters are probably something from the unpacking of the data. The unpacking does not know about the extra data.

Wait, where are you getting 0x31 from? I see it as 0x1 (offset 0x58c in the file).

tompe
01-07-2009, 07:30 PM
Wait, where are you getting 0x31 from? I see it as 0x1 (offset 0x58c in the file).

Ah, it is 0x1. My routine converting to hex is not working properly...

But that explains all the extra null characters.

tompe
01-07-2009, 09:08 PM
If the '1' bit is set, and there are no actual multibyte characters in the text, then each record will end with a NUL byte indicating 0 overlaping bytes. (Well, unless bits one of bits 4-8 is set on the "size & flags" byte.)


I am not sure I get it totally. If bit "1" is set is then the last byte in the record always realated to multibyte characters?

My code now is the following and I wondered if this is a correct understanding of it:

eval {
sub min { return ($_[0]<$_[1]) ? $_[0] : $_[1] }
my $maxi = min($#$recs, $header->{'records'});
for( my $i = 1; $i <= $maxi; $i ++ ) {
my $data = $recs->[$i]->{'data'};
my $len = length($data);
my $overlap = "";
if ($self->{multibyteoverlap}) {
my $c = chop $data;
print STDERR "I:$i - $len - ", int($c), "\n";
my $n = $c & 7;
foreach (0..$n-1) {
$overlap .= chop $data;
}
}

$body .= _decompress_record( $header->{'version'},
$data );
$body .= $overlap;
}
};


Why is three bits used for the size if the maximum size is 3? (I see now that I have reversed the order in $overlap).

llasram
01-07-2009, 09:42 PM
I am not sure I get it totally. If bit "1" is set is then the last byte in the record always realated to multibyte characters?

Almost. It's the *first* trailing entry, which means it immediately follows the text, but may be followed by other trailing entries. If bit 1 is set, plus another bits, you'll have:

<trailing multibyte bytes><multibyte size & flags><trailing data><size>

My code now is the following and I wondered if this is a correct understanding of it:

My Perl is pretty rusty, but I think mostly... Except instead of needing to preserve the overlap, you actually need to just chop it off -- it appears again at the beginning of the next record.

Why is three bits used for the size if the maximum size is 3? (I see now that I have reversed the order in $overlap).

My error. I did byte & 3 to get the size, and for some reason when I was translating the info into the wiki I turned that into 3 bits. It is only 2 bits (which I have updated the wiki to reflect).

nrapallo
01-07-2009, 11:05 PM
Most Feedbooks.com Mobipocket/Kindle offerings have this problem (as they usually are UTF-8 encoded).

...

I've always seen this behaviour with Feedbooks.com .prc/.mobi ebooks.

I'm happy to report that all previous "issues" I've had with Feedbooks.com .mobi ebook conversions using my Mobi2IMP have now been resolved by the recent update (http://www.mobileread.com/forums/showthread.php?p=322232#post322232) to tompe's mobi2html (which I have incorporated into a beta Mobi2IMP). :thumbsup:

tompe
01-08-2009, 06:39 AM
Almost. It's the *first* trailing entry, which means it immediately follows the text, but may be followed by other trailing entries. If bit 1 is set, plus another bits, you'll have:

<trailing multibyte bytes><multibyte size & flags><trailing data><size>


But how do I then detect how many bytes there are in the trailing multibyte bytes? How can I know for sure which byte is the one giving the number of bytes? Or can you parse it in reverse order and it is not ambigious?

llasram
01-08-2009, 08:16 AM
But how do I then detect how many bytes there are in the trailing multibyte bytes? How can I know for sure which byte is the one giving the number of bytes? Or can you parse it in reverse order and it is not ambigious?

Right. You parse each trailing entry backwards. So if all 16 were present, you'd parse #16 at the end of the record, then #15, etc etc on through #1 last. I may have complicated understanding by on the Wiki leaving out the distinction between what I'm calling "forwards-encoded" variable-width integers and "backwards-encoded" ones. The sizes of trailing entries 2-16 are backwards-encoded variable-width integers, encoded with only the high (first) byte having bit 8 set, which means you can most easily read them backwards. So yeah -- start from the end and work backwards :).

This is Calibre's current code for find the total size of the trailing entries:


def sizeof_trailing_entries(self, data):
def sizeof_trailing_entry(ptr, psize):
bitpos, result = 0, 0
while True:
v = ord(ptr[psize-1])
result |= (v & 0x7F) << bitpos
bitpos += 7
psize -= 1
if (v & 0x80) != 0 or (bitpos >= 28) or (psize == 0):
return result

num = 0
size = len(data)
flags = self.book_header.extra_flags >> 1
while flags:
if flags & 1:
num += sizeof_trailing_entry(data, size - num)
flags >>= 1
if self.book_header.extra_flags & 1:
num += (ord(data[size - num - 1]) & 0x3) + 1
return num


HTH!

tompe
01-08-2009, 11:16 AM
Thanks, it was as complicated as I suspected then... These kind of complications seems very odd and I suspect that a specification of the MobiPocket format is not released because either it does not exist or they do not want to show the world how bad the format really is.

Is there any test file available somewhere were the extraflags is something else than 0x1?

llasram
01-08-2009, 12:11 PM
Thanks, it was as complicated as I suspected then... These kind of complications seems very odd and I suspect that a specification of the MobiPocket format is not released because either it does not exist or they do not want to show the world how bad the format really is.

This really isn't that complicated compared to LIT's internal indices -- hash tables and multi-level look-up tables and tree lists oh my. At least Mobipocket realized they needed a backwards-compatible way to specify new trailing entries before they added more than two.

Is there any test file available somewhere were the extraflags is something else than 0x1?

Attached is one I've generated with mobigen.

pdurrant
01-08-2009, 12:22 PM
Oh - very useful stuff. And it turns out that the Mobipocket decoder will need some fixes for cases where bit position 1 is set. I can only suppose that very few commercial DRMed eBooks are out there with that bit set.

Happily easy to fix given this code, once such a book turns up.


This is Calibre's current code for find the total size of the trailing entries:


def sizeof_trailing_entries(self, data):
def sizeof_trailing_entry(ptr, psize):
bitpos, result = 0, 0
while True:
v = ord(ptr[psize-1])
result |= (v & 0x7F) << bitpos
bitpos += 7
psize -= 1
if (v & 0x80) != 0 or (bitpos >= 28) or (psize == 0):
return result

num = 0
size = len(data)
flags = self.book_header.extra_flags >> 1
while flags:
if flags & 1:
num += sizeof_trailing_entry(data, size - num)
flags >>= 1
if self.book_header.extra_flags & 1:
num += (ord(data[size - num - 1]) & 0x3) + 1
return num


HTH!

Jellby
01-08-2009, 01:05 PM
How do you deal with "font-variant: small-caps"? Do you convert <span class="small-caps">Foo Bar</span> into F<font size="-1">OO</font> B<font size="-1">AR</font> ?

I guess "text-transform: uppercase" is easier... (I once found an HTML book where many capital letters were "created" with this property, which meant that copy-pasting gave lowercase letters, it was a pain...)

llasram
01-08-2009, 02:13 PM
How do you deal with "font-variant: small-caps"? Do you convert <span class="small-caps">Foo Bar</span> into F<font size="-1">OO</font> B<font size="-1">AR</font> ?

That's my plan, although I'm not doing it yet. Because so few formats/renderers support anything like 'font-variant small-caps' I'm planning to add it as a general content transform. The spit-and-polish phase of finishing all this stuff is taking a while, but will hopefully be faster for the next format I tackle, since I'll already have everything like 'font-variant' degradation, SVG rasterization, etc already finished.

-Marshall