KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 60

DiapDealer · 07-10-2014, 10:46 AM

When it comes to opf tag attributes, is there any logic built-in to exclude the stuff from a version 3 opf that isn't valid in a version 2 opf?

I know an un-influenced re-creation of the original source has always been (and should be) a higher priority than epub spec adherence, so perhaps it would make more sense if some of these epub3-only properties/attributes could be used to enhance the new auto-detection feature? Last time I checked, it seemed auto-detect only checked for the presence of a couple of fixed-layout properties to make its decision.

I have no idea what it might entail, but I'd like to see a more robust (possibly even heuristic) approach to ensure that the same sort of source that went in is coming back out (if auto-detect is selected).

KevinH · 07-10-2014, 11:11 AM

Hi tkeo,

Quote:

Originally Posted by tkeo

I have never seem but id attribute might in an itemref tag.
http://www.idpf.org/epub/301/spec/ep...c-itemref-elem

I would remove all of those id= since we would not have a complete set and therefore they can't be used for actual references. The same is true for the "refines" use of id=, they can't be decoded without guessing what the id properties would have been on the title and creator and etc. Since those are lost, we can't rebuild the refines either.

That is why I store ALL of the extra metadata from the RESC inside a comment. So no need to strip out the original coverpage info as well.

Quote:

Besides, it makes simpler to generate itemref tags in mobi_opf.py.
So, I prefer to go with B).

I am fine with that but as I said incomplete content.opf pieces with incomplete id= make using those ids in general impossible.

Quote:

As for properties, it is allowed to have more than two values, ex.

Code:

<itemref idref="titlepage" properties="page-spread-right rendition:layout-pre-paginated"/>

This is just note. We can store it as a string in a dict.

I have modified to change from spine_pageprops and spine_linear to spine_pageattributes, prepending 'x_' to cover_name and id attribute.

It was parsed properly and stored as a string anyway. But I understand your motivations and I am okay with that approach, although I would rather we strip out any id= since they are unusable as I explained above.

Quote:

A little bit irregular way though, I have also modified to insert cover page in the case RESC is not exist nor spine is not in the RESC.

No, I don't like that approach as it abuses the filleinfo key field. I would rather simply hard code the linear="no" for this case inside the opf itself, and not abuse the key field in filleinfo in that way. Since we are creating the coverpage we can always mark it as auxiliary (the "no") as it truly is not needed to be shown first in the flow.

Thanks,

KevinH

KevinH · 07-10-2014, 11:33 AM

Hi DiapDealer,

Good point! I have not looked at the epub3 and auto detection code yet and it will need to be updated as well. The package tag and its version is included in some RESC and the current k8resc can easily parse it if present and use its value to help auto detection.

I also think that epub version (or A for auto) should be passed into k8resc as well to help it clean up and remove any epub 3 pieces if the user requests epub2, since most of them come in via the RESC info.

One main problem is the damn refines in epub3 metadata, they use and reference the original "id=" properties on the title, the creator, and on other things but these are all stripped away when that info becomes the EXTH equivalent. The remnants do seem to make it into the RESC but the ids being referenced by the refines are long gone and we can only guess as to which creator or title or whatever they actually refer to.

So that is something, that can only be fixed by hand editing after the reconstruction.

I will try to take look at passing epubver into mobi_k8resc.py and see if I can add some auto-detect code and things to clean up if down versioning.

Thanks,

KevinH

ps, the new mobi_k8resc parse code should look familiar as it is a tweaked version of the old mobiml2html parser we used!

Quote:

Originally Posted by DiapDealer

When it comes to opf tag attributes, is there any logic built-in to exclude the stuff from a version 3 opf that isn't valid in a version 2 opf?

I know an un-influenced re-creation of the original source has always been (and should be) a higher priority than epub spec adherence, so perhaps it would make more sense if some of these epub3-only properties/attributes could be used to enhance the new auto-detection feature? Last time I checked, it seemed auto-detect only checked for the presence of a couple of fixed-layout properties to make its decision.

I have no idea what it might entail, but I'd like to see a more robust (possibly even heuristic) approach to ensure that the same sort of source that went in is coming back out (if auto-detect is selected).

DiapDealer · 07-10-2014, 01:06 PM

Quote:

Originally Posted by KevinH

ps, the new mobi_k8resc parse code should look familiar as it is a tweaked version of the old mobiml2html parser we used!

Good times! I'll have to check it out.

KevinH · 07-10-2014, 01:55 PM

Hi DiapDealer and tkeo, and all:

I need some help on what should be considered an epub 3 feature when auto-detect is used?

Here is what I think so far based on what tkeo had previously:

1. "fixed-layout" (EXTH item 122) in the metadata
2. "page-progression-direction" (EXTH item 527) in the metadata
3. "primary-writing-mode" (EXTH item 525) in the metadata and it ends with "rl"
4. RESC itemrefs have "properties"

I would like to add:
5. RESC package version exists and startswith "3"
6. RESC "spine" has "page-progression-direction" (think tkeo used that as well?)
7. RESC metadata uses "refines"
8. RESC metadata uses meta property= attributes

Are there any others we should add? Are there particular version tags or strings in the metadata that only exist/work for epub3 that we could look for when parsing the RESC?

Thanks,

KevinH

KevinH · 07-10-2014, 03:38 PM

Hi tkeo,

I would like to remove your conversion of Amazon metadata to epub3 as it seems wasteful to re-parse the metadata string we just created (and thereby eliminate taglist).

Instead, I would like to determine the epub version by looking in metadata and k8resc first to determine the target version, then properly building the correct metadata the very first time for that specific version.

That should make everything easier I think.

I have taken a shot at pre-determining the epub version when the opf object is created (if not already specified), only building the metadata once to meet the target version, removing the taglist and mobi_taglist.py completely and polishing up a few things so that epub3 should not at least work.

I think we are getting close to having a finished product once we remove the remaining redundancy in mobi_opf.py

Hopefully within another day or two we will have something to release.

Please see the attached KindleUnpack_v072y_test.zip and let me know what you think.

Take care,

KevinH

tkeo · 07-11-2014, 09:11 AM

Hi,

Quote:

Originally Posted by KevinH

Here is what I think so far based on what tkeo had previously:

1. "fixed-layout" (EXTH item 122) in the metadata
2. "page-progression-direction" (EXTH item 527) in the metadata
3. "primary-writing-mode" (EXTH item 525) in the metadata and it ends with "rl"
4. RESC itemrefs have "properties"

I would like to add:
5. RESC package version exists and startswith "3"
6. RESC "spine" has "page-progression-direction" (think tkeo used that as well?)
7. RESC metadata uses "refines"
8. RESC metadata uses meta property= attributes

Are there any others we should add? Are there particular version tags or strings in the metadata that only exist/work for epub3 that we could look for when parsing the RESC?

I have not used 6. yet.
Additionally, we can probably use,

9. "orientation-lock" (EXTH item 124) in the metadata has "portrait" or "landscap"
Note: "original-resolution" (EXTH item 126) requires 'fixed-layout" is "true"
10. "Title file-as"(EXTH item 508) in the metadata
11. "Creator file-as"(EXTH item 517) in the metadata
12. "Publisher file-as"(EXTH item 522) in the metadata
13. RESC metadata uses "rendition:" prefix
14. RESC metadata tag is <metadata xmlns:dc="http://purl.org/dc/elements/1.1/"> instead of <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns: opf="http://www.idpf.org/2007/opf" xmlns="http://www.idpf.org/2007/opf">

Thanks,

KevinH · 07-11-2014, 10:47 AM

Hi tkeo,

Since the only possible values of EXTH orientation-lock are portrait or landscape, I will simply look to see if it is in the metadata.keys().

Items 10, 11, 12 are really just extensions of epub2 metatdata so I will not force things to epub3 for using them.

As for 14, I already check that any of the new meta tags with "property" are present so I guess this would catch all of these as well. I do look for the rendition namespace in the package attributes though.

Thanks,

Kevin

ps, I will be working on removing the redundancy from mobi_opf.py and then focusing on meta data more fully.

Take care,

KevinH

Quote:

Originally Posted by tkeo

Hi,

I have not used 6. yet.
Additionally, we can probably use,

9. "orientation-lock" (EXTH item 124) in the metadata has "portrait" or "landscap"
Note: "original-resolution" (EXTH item 126) requires 'fixed-layout" is "true"
10. "Title file-as"(EXTH item 508) in the metadata
11. "Creator file-as"(EXTH item 517) in the metadata
12. "Publisher file-as"(EXTH item 522) in the metadata
13. RESC metadata uses "rendition:" prefix
14. RESC metadata tag is <metadata xmlns:dc="http://purl.org/dc/elements/1.1/"> instead of <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns: opf="http://www.idpf.org/2007/opf" xmlns="http://www.idpf.org/2007/opf">

Thanks,

tkeo · 07-11-2014, 11:10 AM

Hi Kevin,

I have also been doing to reduce redundancy.
I have not completed yet and there are bugs.
I have attached my version just for reference.

Take care,
tkeo

KevinH · 07-11-2014, 02:28 PM

Hi tkeo, (FYI: other co-developers and testers)

Thanks for that bug fix in decoding package tags in mobi_k8resc.py

I have tried to incorporate your mobi_opf.py changes into mine. We were pretty close on many things. I see you want to break out epub3 metadata from the general case and we can do that later, but right now I went with an integrated one we already had.

So attached is KindleUnpack_v072z_test.zip which we can start heavy testing on to make sure nothing is broken for PrintReplica and older Mobis as well as epub2 and epub3.

For epub3 I have added in the automatic generation of dcterms:modifed to meet the minimum epub3 metadata spec.

So hopefully, all that remains is some bug hunting and corner cases to resolve and we can make this a public release!

After that, if you want you can start on your epub3 specific metatdata changes and hopefully try to figure out a way to fix the refines info and integrate the RESC extra metadata into the final product without needing to comment it out.

I have run out of free time recently so hopefully you can take the lead on all of that after we get any bugs ironed out and a stable v073 release made and available to all.

Thanks for all of your hard work on this!

Edit: I just finished studying the epub 3 metadata and it disallows all opf: prefixes like file-as, role and schemes. Therefore the dc:identifier is different under epub3 for urn:uuid, isbn, etc.

So you were right and we do need an epub 3 specific metadata routine for even the basics just to handle the refines of file-as and role and identifiers properly even etc for the most basic EXTH values and not just for fixed-layout and related things.

I will play around with this a bit too.

Take care,

KevinH

So here is

Quote:

Originally Posted by tkeo

Hi Kevin,

I have also been doing to reduce redundancy.
I have not completed yet and there are bugs.
I have attached my version just for reference.

Take care,
tkeo

tkeo · 07-12-2014, 01:24 AM

Hi,

In the Calibre KindleUnpack Plugin thread, an error is reported.
https://www.mobileread.com/forums/sho...&postcount=215
https://www.mobileread.com/forums/sho...&postcount=225

Quote:

calibre, version 1.43.0
ERROR: KindleUnpack - The Plugin v0.67.0: cannot fit 'long' into an index-sized integer

Traceback (most recent call last):
File "calibre_plugins.kindleunpack_plugin.extractio n", line 192, in unpack_ebook
File "calibre_plugins.kindleunpack_plugin.utilities ", line 283, in unpackMOBI
File "calibre_plugins.kindleunpack_plugin.kindleunp ack. kindleunpack", line 1638, in unpackBook
File "calibre_plugins.kindleunpack_plugin.kindleunp ack. kindleunpack", line 1335, in process_all_mobi_headers
IndexError: cannot fit 'long' into an index-sized integer

Quote:

calibre, version 1.44.0
ERROR: KindleUnpack - The Plugin v0.72.1: cannot fit 'long' into an index-sized integer

Traceback (most recent call last):
File "calibre_plugins.kindleunpack_plugin.extractio n", line 192, in unpack_ebook
File "calibre_plugins.kindleunpack_plugin.utilities ", line 283, in unpackMOBI
File "calibre_plugins.kindleunpack_plugin.kindleunp ack. kindleunpack", line 895, in unpackBook
File "calibre_plugins.kindleunpack_plugin.kindleunp ack. kindleunpack", line 805, in process_all_mobi_headers
File "calibre_plugins.kindleunpack_plugin.kindleunp ack. kindleunpack", line 168, in renameCoverImage
IndexError: cannot fit 'long' into an index-sized integer

In my guees, the cause of the error is that CoverOffset EXTH value is the out of range of int. I think It is possible to be fixed by replacing,

Code:

i = int(metadata['CoverOffset'][0])
if imgnames[i] is not None:

to

Code:

i = int(metadata['CoverOffset'][0])
if i >= 0 and i < len(imgnames) and imgnames[i] is not None:

and

Code:

imageNumber = int(metadata['CoverOffset'][0])
cover_image = self.imgnames[imageNumber]

to

Code:

imageNumber = int(metadata['CoverOffset'][0])
if imageNumber >= 0 and self.imageNumber < len(self.imgnames):
    cover_image = imgnames[imageNumber]

Reported versions are v0.67 and v0.72.1(core version v.72a); however, perhaps the latest version have same codes.
I will post fixed test version to the Calibre Plugin thread to see it work or not.

I am no idea why CoverOffset EXTH has such a value. Is is needed to be fixed in the latest version?

Thanks,

tkeo · 07-12-2014, 05:36 AM

Hi,

Quote:

Originally Posted by KevinH

Since the only possible values of EXTH orientation-lock are portrait or landscape, I will simply look to see if it is in the metadata.keys().

I have made a book of which EXTH orientation-lock is none.

EDIT We can set any value to EXTH orientation-lock through

Code:

<meta name="orientation-lock" content="XXXX"/>

But valid values are portrait and landscape.

And do you know how to set values to EXTH item 508 (Title file-as), EXTH item 517(Creator file-as) and EXTH item 522 (Publisher file-as)? They seem not converted from refine meta tags.

Thanks,

KevinH · 07-12-2014, 08:41 AM

Hi tkeo,

Quote:

Originally Posted by tkeo

Hi,

I have made a book of which EXTH orientation-lock is none.

EDIT We can set any value to EXTH orientation-lock through

Code:

<meta name="orientation-lock" content="XXXX"/>

But valid values are portrait and landscape.

Understood, we can change that setting to check only for valid values to determine if EPUB 3.

Quote:

And do you know how to set values to EXTH item 508 (Title file-as), EXTH item 517(Creator file-as) and EXTH item 522 (Publisher file-as)? They seem not converted from refine meta tags.

I found that out with testing myself yesterday. I don't know where or who reversed those tags meanings. I have searched and none of my Amazon ebooks have any of those EXTH tags set. I also ran "strings" on kindlegen and grepped it and the term "file-as" is not even found in that binary. Perhaps some other kindlegen version or kindletool or kdp sets that EXTH. That or whoever added that info was incorrect. Let's just ignore those EXTH values until we know something more.

Take care,

Kevin

AcidWeb · 07-12-2014, 08:45 AM

orientation-lock have three valid values: portrait, landscape and
none

primary-writing-mode have four: horizontal-lr, horizontal-rl, vertical-lr, vertical-rl

There is also boolean RegionMagnification that inform reader if pages in book have Panel View code embedded.

KevinH · 07-12-2014, 08:46 AM

Hi,

My bet is that that EXTH was set to 0xffffffff which is often used as a placeholder for missing values in MobiHeaders. The size field of the EXTH value must be corrupt or broken. I would rather detect that during EXTH parsing and leave the code as is.

Please try running a recent version of DumpMobiHeader_v016 or later on the problem ebook so check the field size and the unsigned hex value.

Thanks,

KevinH

Quote:

Originally Posted by tkeo

Hi,

In the Calibre KindleUnpack Plugin thread, an error is reported.
https://www.mobileread.com/forums/sho...&postcount=215
https://www.mobileread.com/forums/sho...&postcount=225

In my guees, the cause of the error is that CoverOffset EXTH value is the out of range of int. I think It is possible to be fixed by replacing,

Code:

i = int(metadata['CoverOffset'][0])
if imgnames[i] is not None:

to

Code:

i = int(metadata['CoverOffset'][0])
if i >= 0 and i < len(imgnames) and imgnames[i] is not None:

and

Code:

imageNumber = int(metadata['CoverOffset'][0])
cover_image = self.imgnames[imageNumber]

to

Code:

imageNumber = int(metadata['CoverOffset'][0])
if imageNumber >= 0 and self.imageNumber < len(self.imgnames):
    cover_image = imgnames[imageNumber]

Reported versions are v0.67 and v0.72.1(core version v.72a); however, perhaps the latest version have same codes.
I will post fixed test version to the Calibre Plugin thread to see it work or not.

I am no idea why CoverOffset EXTH has such a value. Is is needed to be fixed in the latest version?

Thanks,

07-10-2014, 01:55 PM	#890
KevinH Sigil Developer Posts: 7,645 Karma: 5433388 Join Date: Nov 2009 Device: many	Help Needed Detecting what is epub 3 Hi DiapDealer and tkeo, and all: I need some help on what should be considered an epub 3 feature when auto-detect is used? Here is what I think so far based on what tkeo had previously: 1. "fixed-layout" (EXTH item 122) in the metadata 2. "page-progression-direction" (EXTH item 527) in the metadata 3. "primary-writing-mode" (EXTH item 525) in the metadata and it ends with "rl" 4. RESC itemrefs have "properties" I would like to add: 5. RESC package version exists and startswith "3" 6. RESC "spine" has "page-progression-direction" (think tkeo used that as well?) 7. RESC metadata uses "refines" 8. RESC metadata uses meta property= attributes Are there any others we should add? Are there particular version tags or strings in the metadata that only exist/work for epub3 that we could look for when parsing the RESC? Thanks, KevinH Last edited by KevinH; 07-10-2014 at 03:31 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

07-10-2014, 10:46 AM	#886
DiapDealer Grand Sorcerer Posts: 27,551 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	When it comes to opf tag attributes, is there any logic built-in to exclude the stuff from a version 3 opf that isn't valid in a version 2 opf? I know an un-influenced re-creation of the original source has always been (and should be) a higher priority than epub spec adherence, so perhaps it would make more sense if some of these epub3-only properties/attributes could be used to enhance the new auto-detection feature? Last time I checked, it seemed auto-detect only checked for the presence of a couple of fixed-layout properties to make its decision. I have no idea what it might entail, but I'd like to see a more robust (possibly even heuristic) approach to ensure that the same sort of source that went in is coming back out (if auto-detect is selected).

07-12-2014, 08:45 AM	#899
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	orientation-lock have three valid values: portrait, landscape and none primary-writing-mode have four: horizontal-lr, horizontal-rl, vertical-lr, vertical-rl There is also boolean RegionMagnification that inform reader if pages in book have Panel View code embedded.