KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 63

DiapDealer · 07-24-2014, 03:48 PM

Quote:

Originally Posted by davidnwelton

I am sorry to butt in, but I was just curious if you guys had considered putting the source code for mobiunpack up on github or something like that that makes it easier to collaborate on.

Thanks

It's been talked about several time before in the thread. I don't think anyone really wants to be saddled with being the maintainer, or just making sure there ARE current maintainers who can guard the kingdom. Maybe some day, but unless Amazon introduces new features into the format, development often comes to stand-still for long periods of time. We could come back from a long hiatus and find that no one with a key to the front door is around anymore.

KevinH · 07-24-2014, 03:56 PM

Hi,

Already tried that and it did not work. We had it on google code for years and nothing was ever done or even used. So we took it down.

The universe of potential users of this project is actually quite small and often don't visit any sourcecode/google code/github sites. Actual editors and authors find us here. There are only ever 2 or 3 active developers at one time.

So we get better input on KindleUnpack from this forum and exchanging patches than we ever did from having a repository. Except for a recent flourish of new features this past month, KindleUnpack is considered reasonably stable.

If you have something to contribute simply ask about it here and then post a patch.

If your patch does not impede the intent of the package it will most likely be accepted. But please note: KindleUnpack is a Kindle mobi/azw3 diagnostic tool that allows ebook editors/authors to see how their code (and the code of others) has been changed by kindlegen, fix minor bugs, provide a python based documentation as to the compiled mobi/azw4/azw3 format.

It is not a standalone Mobi to epub generator. If you want that, please look towards Calibre instead.

Hope this helps,

KevinH

tkeo · 07-25-2014, 10:03 PM

Hi,

Quote:

Originally Posted by davidnwelton

I am sorry to butt in, but I was just curious if you guys had considered putting the source code for mobiunpack up on github or something like that that makes it easier to collaborate on.

Other have already answered the reasons of not using such as github. There is a repository of KindleUnpack on github created by quiris; however, it is not updated from v0.71 (the latest version is v0.73).
https://github.com/quiris11/KindleUnpack.

I use git locally a little but not have any github account.
Thanks,

tkeo · 07-26-2014, 09:04 AM

Hi,

I have modified KindleUnpack v0.73. Modifications are as follows:

added refines metadata processing
fixed language code in the ncx and title in the navigation document
added F (force to fit to epub2) option to epubver for removing epub3 attribute to fit to epub2 definition

I feel the removing epub3 attribute is needed to discuss the necessity and how to switch.

In addition, I am considering to move adding metaguidetext to guidetext from mobi_opf.py to processMobi7() in kindleunpack.py.

Please give opinions if you have.
Thanks,

CAUTION This update is under development, not intent to end users because the specification is not fixed.

KevinH · 07-26-2014, 11:24 AM

Hi tkeo,

Quote:

Originally Posted by tkeo

Hi,
I have modified KindleUnpack v0.73. Modifications are as follows:

1. added refines metadata processing

Is this only for single creator? How did you deal with multiple creators?

Quote:

[*]fixed language code in the ncx and title in the navigation document[*]added F (force to fit to epub2) option to epubver for removing epub3 attribute to fit to epub2 definition

Great! These are useful additions.

Quote:

I feel the removing epub3 attribute is needed to discuss the necessity and how to switch.

I agree, here are a few things to consider during this forced conversion:

1. replace section tag with <div data-tag="section"> and similar for closing tag
2. replace epub:type=blah" attributes with data-epub-type="blah" to keep the semantic meaning
3. allow video and audio tags to go through as it
4. deal with < aside > in some sane manner
5. add epub_type vocabulary to guide elements where crossover exists and to nav if possible
6. convert nav to toc if toc does not exist
7. convert meta data from new format back to using older format (with opf:scheme, opf:fileas, opf:role) replacing refines with something more sane
8. remove cover manifest property and add in required meta name="cover"
9. there are probably a few other new tags we should convert as well
10 ...

Please add to the list above. Once we agree on the best way to force epub 2, I would also like a similar way to reverse all of this (including reversing the section to div, reversing epub:type to data:epub type, etc ) to force generation of a valid epub 3 from an epub 2 with extras added starting point.

Quote:

In addition, I am considering to move adding metaguidetext to guidetext from mobi_opf.py to processMobi7() in kindleunpack.py.

That matches what we do for kf8 so that is a good idea.

I am also considering adding in my mobiml2html.py code to convert mobi 7 to something importable into calibre and Sigil, that can be further edited.

I would also like to add a feature that provides the best single output format using the following scheme:

1. if mobi includes SRCS record, then return kindlegensrc.zip is provided return it
2. if mobi has no source, but has azw3, return our unpacked epub from mobi8
3. if mobi has no source, and no kf8 part, use the mobiml2html.py code to return at least parseable proper xhtml version of mobi 7 output.

Please let me know what you think.

Quote:

CAUTION This update is under development, not intent to end users because the specification is not fixed.

So don't ask for versions of it for a plugin or for your bug reporting JSWolf! ;-)

KevinH

tkeo · 07-28-2014, 09:18 AM

Hi,

Quote:

Originally Posted by KevinH

Is this only for single creator? How did you deal with multiple creators?

If there are only one pairs for title, publisher and creator, and correspinding EXTH for furigana, the refiens tags are not commented out. Otherwise they are commented out. The followings is an example.

Code:

<?xml version="1.0" encoding="utf-8"?>
<package version="3.0" xmlns="http://www.idpf.org/2007/opf" prefix="rendition: http://www.idpf.org/vocab/rendition/#" unique-identifier="uid">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title id="title01">AAAA</dc:title>
<dc:language>ja</dc:language>
<dc:identifier id="uid">3232262294</dc:identifier>
<meta property="dcterms:modified">2014-07-26T08:09:39Z</meta>
<dc:creator id="creator01">XXX</dc:creator>
<dc:creator id="creator02">YYY</dc:creator>
<dc:publisher id="publisher01">BBBB</dc:publisher>
<dc:date opf:event="publication">2011-05-30</dc:date>
<!-- Refines MetaData from EXTH -->
<meta property="file-as" refines="#title01">aaaa</meta>
<meta property="file-as" refines="#publisher01">bbbb</meta>
<!-- THE FOLLOWINGS ARE REQUIRED TO EDIT IDS MANUALLY
<meta property="file-as" refines="#creator01">yyy/meta>
<meta property="file-as" refines="#creator02">xxx</meta>
<meta scheme="marc:relators" property="role" refines="#creator01">aut</meta>
<meta property="display-seq" refines="#creator01">1</meta>
-->

Quote:

I agree, here are a few things to consider during this forced conversion:

1. replace section tag with <div data-tag="section"> and similar for closing tag
2. replace epub:type=blah" attributes with data-epub-type="blah" to keep the semantic meaning
3. allow video and audio tags to go through as it
4. deal with < aside > in some sane manner
5. add epub_type vocabulary to guide elements where crossover exists and to nav if possible
6. convert nav to toc if toc does not exist
7. convert meta data from new format back to using older format (with opf:scheme, opf:fileas, opf:role) replacing refines with something more sane
8. remove cover manifest property and add in required meta name="cover"
9. there are probably a few other new tags we should convert as well
10 ...

6 and 8 are already done. In our KindleUnpack, the nav section in the nav.xhtml is geterated from ncx (i.e. NCXProcessor), so the ncx always exists.
for 7, they need to convert from refines metadata, so, the problem is solving the id correspondence as same as the refines metadata.

JSWolf · 07-28-2014, 10:33 AM

In forcing to ePub2, the video and audio tags might cause a problem. Consider dropping them.

tkeo · 07-29-2014, 09:28 AM

Hi,

I would like to comfirm that we are going to create converting a K8 epub-like structure with epub2 tags which is accepted as a source for kindlegen version 2.?, for F option, is it right?

Thanks,

KevinH · 07-29-2014, 05:45 PM

Hi tkeo,

If the user specifies F to force to epub2, my guess is they want the epub version 2 for their own use and probably won't most be passing it back through kindlegen which candeal with the epub 3 features just fine. My guess, they probably want to load it into calibre or Sigil for further editing but neither really support epub 3.

So if we can take the epub 3 features and convert them as little as possible, making liberal use of the data-* attribute and comments specially marked to be reversible, down-convert the epub3 metadata and the like, the user may be able to edit it in calibre or Sigil and get it to validate, and yet make it easy to auto convert back to epub 3 if possible.

That is the plan anyway.

Take care,

KevinH

Quote:

Originally Posted by tkeo

Hi,

I would like to comfirm that we are going to create converting a K8 epub-like structure with epub2 tags which is accepted as a source for kindlegen version 2.?, for F option, is it right?

Thanks,

tkeo · 07-31-2014, 09:03 AM

Hi Kevin,

Quote:

Originally Posted by KevinH

1. replace section tag with <div data-tag="section"> and similar for closing tag
2. replace epub:type=blah" attributes with data-epub-type="blah" to keep the semantic meaning
3. allow video and audio tags to go through as it
4. deal with < aside > in some sane manner
5. add epub_type vocabulary to guide elements where crossover exists and to nav if possible
6. convert nav to toc if toc does not exist
7. convert meta data from new format back to using older format (with opf:scheme, opf:fileas, opf:role) replacing refines with something more sane
8. remove cover manifest property and add in required meta name="cover"
9. there are probably a few other new tags we should convert as well
10 ...

Although we are not sure about the best way of force-conversion to epub2 tags and the list is completed or not, I have modified to fulfill the 7 in the list, in order to confirm that this conversion is matched to the purpose or not.

In comparison with the v0.73, v0.73b (and maybe v0.74) is rather minor functional improvement, so, I would like to do a slower pace.

Thanks,

lglgaigogo · 08-01-2014, 08:03 AM

The program now can't do well with dictionary coded in utf-8
When I use the kindlegen option: -western, this problem won't occur.

1.Tag <idx

rth> , value attribute seem to be messy
2.Tag <idx:iform>, value attribute seem to be messy

It seems it uses a weird table to index the character.
The table :

the dictionary can be found at:
https://github.com/lglgaigogo/AI2KD/...ish%205th.mobi

DiapDealer · 08-01-2014, 08:53 AM

Dictionary support has always been a bit touch and go.

KevinH · 08-01-2014, 11:46 AM

I'm out of town and out of touch for the next 10 days or so. When I get back, I will download the dictionary and try to reproduce the issue. ORDT sections like that represent a byte mapping of one character encoding into another, typically multi-byte. I have seen this issue in some sample ebooks. It is caused by the generating machine using a strange charset like 65002 versus the more typical 65001 (utf-8).

If you want to play around looke in the mobi_index.py file for the strings horde and ORDT. As the code comments say, there are two ORDT provided but we could only figure out what the second one was for. We may now figure it out from your testcase.

KevinH

lglgaigogo · 08-02-2014, 02:06 AM

Quote:

Originally Posted by KevinH

I'm out of town and out of touch for the next 10 days or so. When I get back, I will download the dictionary and try to reproduce the issue. ORDT sections like that represent a byte mapping of one character encoding into another, typically multi-byte. I have seen this issue in some sample ebooks. It is caused by the generating machine using a strange charset like 65002 versus the more typical 65001 (utf-8).

If you want to play around looke in the mobi_index.py file for the strings horde and ORDT. As the code comments say, there are two ORDT provided but we could only figure out what the second one was for. We may now figure it out from your testcase.

KevinH

Thank you for paying attention on my issue. I am now try to understand the non western character encoding pattern.
Thank you.

For now, I figure out:

1.Every character has 2 bytes index
2.For western letters it should be like 00 XX ,for example, 'a' is 00 03, 'b' is 00 64, and look up the table ORDT:
ORDT[3*2+1] is 'a'
ORDT[64*2+1] is 'b'
3.For non western letters, it should be like XX XX, for example, '潘' is 6F 58, and in python：

Code:

 print u"\u6F58" # is exactly the character '潘'

tkeo · 08-12-2014, 08:09 AM

This is an experimental version just to exhibit it can be extracted audio and video in a mobi.

I have modifed kindleunpack.py to extract AUDI and VIDE sections which contain an audio and a video respectively.

I have modifed to extract them to the HDImage folder;
however, the extracted files are not linked in xhtmls,
suffixes of the files are hard-coded to '.mp3' and '.mp4.'

Thanks,

08-01-2014, 08:03 AM	#941
lglgaigogo Junior Member Posts: 2 Karma: 10 Join Date: Aug 2014 Device: kindle paper white	The program now can't do well with dictionary coded in utf-8 When I use the kindlegen option: -western, this problem won't occur. 1.Tag <idxrth> , value attribute seem to be messy 2.Tag <idx:iform>, value attribute seem to be messy It seems it uses a weird table to index the character. The table : the dictionary can be found at: https://github.com/lglgaigogo/AI2KD/...ish%205th.mobi Last edited by lglgaigogo; 08-02-2014 at 02:44 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

07-24-2014, 03:56 PM	#932
KevinH Sigil Developer Posts: 7,630 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Already tried that and it did not work. We had it on google code for years and nothing was ever done or even used. So we took it down. The universe of potential users of this project is actually quite small and often don't visit any sourcecode/google code/github sites. Actual editors and authors find us here. There are only ever 2 or 3 active developers at one time. So we get better input on KindleUnpack from this forum and exchanging patches than we ever did from having a repository. Except for a recent flourish of new features this past month, KindleUnpack is considered reasonably stable. If you have something to contribute simply ask about it here and then post a patch. If your patch does not impede the intent of the package it will most likely be accepted. But please note: KindleUnpack is a Kindle mobi/azw3 diagnostic tool that allows ebook editors/authors to see how their code (and the code of others) has been changed by kindlegen, fix minor bugs, provide a python based documentation as to the compiled mobi/azw4/azw3 format. It is not a standalone Mobi to epub generator. If you want that, please look towards Calibre instead. Hope this helps, KevinH

07-28-2014, 10:33 AM	#937
JSWolf Resident Curmudgeon Posts: 73,887 Karma: 128597114 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	In forcing to ePub2, the video and audio tags might cause a problem. Consider dropping them.

07-29-2014, 09:28 AM	#938
tkeo Connoisseur Posts: 94 Karma: 10 Join Date: Feb 2014 Location: Japan Device: Kindle PaperWhite, Kobo Aura HD	Hi, I would like to comfirm that we are going to create converting a K8 epub-like structure with epub2 tags which is accepted as a source for kindlegen version 2.?, for F option, is it right? Thanks,

08-01-2014, 08:53 AM	#942
DiapDealer Grand Sorcerer Posts: 27,546 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Dictionary support has always been a bit touch and go.

08-01-2014, 11:46 AM	#943
KevinH Sigil Developer Posts: 7,630 Karma: 5433388 Join Date: Nov 2009 Device: many	I'm out of town and out of touch for the next 10 days or so. When I get back, I will download the dictionary and try to reproduce the issue. ORDT sections like that represent a byte mapping of one character encoding into another, typically multi-byte. I have seen this issue in some sample ebooks. It is caused by the generating machine using a strange charset like 65002 versus the more typical 65001 (utf-8). If you want to play around looke in the mobi_index.py file for the strings horde and ORDT. As the code comments say, there are two ORDT provided but we could only figure out what the second one was for. We may now figure it out from your testcase. KevinH

Advert

Advert