Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 01-07-2009, 05:58 PM   #46
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by llasram View Post
Concerning multibyte character overlap. How do you know which byte is the size byte?

Are these characters and the trailing data part of the record size or are they outside the specified record size?
tompe is offline   Reply With Quote
Old 01-07-2009, 06:00 PM   #47
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by tompe View Post
I looked at the output from --rawhtml but could not find any UTF-8 characters... But there is null characters in the file. But that is the data directly from the Perl module unpacking the compressed data so this is probably releated to something else. UTF-8 ought not to produce null characters.
The extra data flag is set to 0x31 for this file. So the extra characters are probably something from the unpacking of the data. The unpacking does not know about the extra data.
tompe is offline   Reply With Quote
Old 01-07-2009, 07:03 PM   #48
Hadrien
Feedbooks.com Co-Founder
Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.
 
Hadrien's Avatar
 
Posts: 2,263
Karma: 145123
Join Date: Nov 2006
Location: Paris, France
Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad
Quote:
Originally Posted by nrapallo View Post
Most Feedbooks.com Mobipocket/Kindle offerings have this problem (as they usually are UTF-8 encoded).
We use UTF-8 on 100% of our files actually.
Hadrien is offline   Reply With Quote
Old 01-07-2009, 07:03 PM   #49
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by tompe View Post
The extra data flag is set to 0x31 for this file. So the extra characters are probably something from the unpacking of the data. The unpacking does not know about the extra data.
If the '1' bit is set, and there are no actual multibyte characters in the text, then each record will end with a NUL byte indicating 0 overlaping bytes. (Well, unless bits one of bits 4-8 is set on the "size & flags" byte.)

Quote:
Originally Posted by tompe View Post
Concerning multibyte character overlap. How do you know which byte is the size byte?
It's the last byte of that trailing entry.

Quote:
Originally Posted by tompe View Post
Are these characters and the trailing data part of the record size or are they outside the specified record size?
As I understood it, the "record size" was just the distance to the next record. In which case yes, they are part of the record they follow.
llasram is offline   Reply With Quote
Old 01-07-2009, 07:10 PM   #50
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by tompe View Post
The extra data flag is set to 0x31 for this file. So the extra characters are probably something from the unpacking of the data. The unpacking does not know about the extra data.
Wait, where are you getting 0x31 from? I see it as 0x1 (offset 0x58c in the file).
llasram is offline   Reply With Quote
Old 01-07-2009, 07:30 PM   #51
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by llasram View Post
Wait, where are you getting 0x31 from? I see it as 0x1 (offset 0x58c in the file).
Ah, it is 0x1. My routine converting to hex is not working properly...

But that explains all the extra null characters.
tompe is offline   Reply With Quote
Old 01-07-2009, 09:08 PM   #52
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by llasram View Post
If the '1' bit is set, and there are no actual multibyte characters in the text, then each record will end with a NUL byte indicating 0 overlaping bytes. (Well, unless bits one of bits 4-8 is set on the "size & flags" byte.)
I am not sure I get it totally. If bit "1" is set is then the last byte in the record always realated to multibyte characters?

My code now is the following and I wondered if this is a correct understanding of it:
Code:
                eval {
                    sub min { return ($_[0]<$_[1]) ? $_[0] : $_[1] }
                    my $maxi = min($#$recs, $header->{'records'});
                    for( my $i = 1; $i <= $maxi; $i ++ ) {
                        my $data = $recs->[$i]->{'data'};
                        my $len = length($data);
                        my $overlap = "";
                        if ($self->{multibyteoverlap}) {
                            my $c = chop $data;
                            print STDERR "I:$i - $len - ", int($c), "\n";
                            my $n = $c & 7;
                            foreach (0..$n-1) {
                                $overlap .= chop $data;
                            }
                        }

                        $body .= _decompress_record( $header->{'version'},
                                                     $data );
                        $body .= $overlap;
                    }
                };
Why is three bits used for the size if the maximum size is 3? (I see now that I have reversed the order in $overlap).
tompe is offline   Reply With Quote
Old 01-07-2009, 09:42 PM   #53
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by tompe View Post
I am not sure I get it totally. If bit "1" is set is then the last byte in the record always realated to multibyte characters?
Almost. It's the *first* trailing entry, which means it immediately follows the text, but may be followed by other trailing entries. If bit 1 is set, plus another bits, you'll have:

Code:
<trailing multibyte bytes><multibyte size & flags><trailing data><size>
Quote:
Originally Posted by tompe View Post
My code now is the following and I wondered if this is a correct understanding of it:
My Perl is pretty rusty, but I think mostly... Except instead of needing to preserve the overlap, you actually need to just chop it off -- it appears again at the beginning of the next record.

Quote:
Originally Posted by tompe View Post
Why is three bits used for the size if the maximum size is 3? (I see now that I have reversed the order in $overlap).
My error. I did byte & 3 to get the size, and for some reason when I was translating the info into the wiki I turned that into 3 bits. It is only 2 bits (which I have updated the wiki to reflect).
llasram is offline   Reply With Quote
Old 01-07-2009, 11:05 PM   #54
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Quote:
Originally Posted by nrapallo View Post
Most Feedbooks.com Mobipocket/Kindle offerings have this problem (as they usually are UTF-8 encoded).

...

I've always seen this behaviour with Feedbooks.com .prc/.mobi ebooks.
I'm happy to report that all previous "issues" I've had with Feedbooks.com .mobi ebook conversions using my Mobi2IMP have now been resolved by the recent update to tompe's mobi2html (which I have incorporated into a beta Mobi2IMP).
nrapallo is offline   Reply With Quote
Old 01-08-2009, 06:39 AM   #55
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by llasram View Post
Almost. It's the *first* trailing entry, which means it immediately follows the text, but may be followed by other trailing entries. If bit 1 is set, plus another bits, you'll have:

Code:
<trailing multibyte bytes><multibyte size & flags><trailing data><size>
But how do I then detect how many bytes there are in the trailing multibyte bytes? How can I know for sure which byte is the one giving the number of bytes? Or can you parse it in reverse order and it is not ambigious?
tompe is offline   Reply With Quote
Old 01-08-2009, 08:16 AM   #56
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by tompe View Post
But how do I then detect how many bytes there are in the trailing multibyte bytes? How can I know for sure which byte is the one giving the number of bytes? Or can you parse it in reverse order and it is not ambigious?
Right. You parse each trailing entry backwards. So if all 16 were present, you'd parse #16 at the end of the record, then #15, etc etc on through #1 last. I may have complicated understanding by on the Wiki leaving out the distinction between what I'm calling "forwards-encoded" variable-width integers and "backwards-encoded" ones. The sizes of trailing entries 2-16 are backwards-encoded variable-width integers, encoded with only the high (first) byte having bit 8 set, which means you can most easily read them backwards. So yeah -- start from the end and work backwards .

This is Calibre's current code for find the total size of the trailing entries:

Code:
def sizeof_trailing_entries(self, data):
    def sizeof_trailing_entry(ptr, psize):
        bitpos, result = 0, 0
        while True:
            v = ord(ptr[psize-1])
            result |= (v & 0x7F) << bitpos
            bitpos += 7
            psize -= 1
            if (v & 0x80) != 0 or (bitpos >= 28) or (psize == 0):
                return result
    
    num = 0
    size = len(data)
    flags = self.book_header.extra_flags >> 1
    while flags:
        if flags & 1:
            num += sizeof_trailing_entry(data, size - num)
        flags >>= 1
    if self.book_header.extra_flags & 1:
        num += (ord(data[size - num - 1]) & 0x3) + 1
    return num
HTH!
llasram is offline   Reply With Quote
Old 01-08-2009, 11:16 AM   #57
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Thanks, it was as complicated as I suspected then... These kind of complications seems very odd and I suspect that a specification of the MobiPocket format is not released because either it does not exist or they do not want to show the world how bad the format really is.

Is there any test file available somewhere were the extraflags is something else than 0x1?
tompe is offline   Reply With Quote
Old 01-08-2009, 12:11 PM   #58
llasram
Reticulator of Tharn
llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.llasram ought to be getting tired of karma fortunes by now.
 
llasram's Avatar
 
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
Quote:
Originally Posted by tompe View Post
Thanks, it was as complicated as I suspected then... These kind of complications seems very odd and I suspect that a specification of the MobiPocket format is not released because either it does not exist or they do not want to show the world how bad the format really is.
This really isn't that complicated compared to LIT's internal indices -- hash tables and multi-level look-up tables and tree lists oh my. At least Mobipocket realized they needed a backwards-compatible way to specify new trailing entries before they added more than two.

Quote:
Originally Posted by tompe View Post
Is there any test file available somewhere were the extraflags is something else than 0x1?
Attached is one I've generated with mobigen.
Attached Files
File Type: mobi trailing.mobi (6.8 KB, 324 views)
llasram is offline   Reply With Quote
Old 01-08-2009, 12:22 PM   #59
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 74,412
Karma: 318076944
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Oasis
Oh - very useful stuff. And it turns out that the Mobipocket decoder will need some fixes for cases where bit position 1 is set. I can only suppose that very few commercial DRMed eBooks are out there with that bit set.

Happily easy to fix given this code, once such a book turns up.

Quote:
Originally Posted by llasram View Post
This is Calibre's current code for find the total size of the trailing entries:

Code:
def sizeof_trailing_entries(self, data):
    def sizeof_trailing_entry(ptr, psize):
        bitpos, result = 0, 0
        while True:
            v = ord(ptr[psize-1])
            result |= (v & 0x7F) << bitpos
            bitpos += 7
            psize -= 1
            if (v & 0x80) != 0 or (bitpos >= 28) or (psize == 0):
                return result
    
    num = 0
    size = len(data)
    flags = self.book_header.extra_flags >> 1
    while flags:
        if flags & 1:
            num += sizeof_trailing_entry(data, size - num)
        flags >>= 1
    if self.book_header.extra_flags & 1:
        num += (ord(data[size - num - 1]) & 0x3) + 1
    return num
HTH!
pdurrant is offline   Reply With Quote
Old 01-08-2009, 01:05 PM   #60
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,570
Karma: 20150435
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
How do you deal with "font-variant: small-caps"? Do you convert <span class="small-caps">Foo Bar</span> into F<font size="-1">OO</font> B<font size="-1">AR</font> ?

I guess "text-transform: uppercase" is easier... (I once found an HTML book where many capital letters were "created" with this property, which meant that copy-pasting gave lowercase letters, it was a pain...)
Jellby is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
LRF output kovidgoyal Calibre 873 04-06-2010 02:32 PM
Trying to get consistent look to all output daveps Calibre 0 03-08-2010 02:18 PM
Best Output for Kindle 2 brewjono Calibre 4 01-28-2010 08:55 PM
PRC output Nate the great Calibre 6 10-17-2009 12:58 AM
One last oeb2mobi test... llasram Kindle Formats 13 01-15-2009 11:20 AM


All times are GMT -4. The time now is 06:10 AM.


MobileRead.com is a privately owned, operated and funded community.