KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 57

KevinH · 06-29-2014, 10:56 AM

Hi,

Yes, please go ahead and post just the new mobi_split.py code if there is a significant speed improvement. That way people can drop it in whatever version of KindleUnpack they currently use for testing purposes.

I will take a look at the opf code and all of your new routines and let you know what I think. I'll be traveling for the next week and out of reach of the internet. So please take a shot at whatever approach you feel is best and I will do the same and we can compare approaches upon my return.

Thanks,

KevinH

Quote:

Originally Posted by tkeo

Hi Kevin,

I feel the necessity of discussion to modify 'mobi_opf.py,' so I have not touched.
Followings are in my mind,

1. The most redundant part is manifest generation. Is it better to merge buildOPFManifest() and buildEPUB2OPFManifest() to buildEPUB3OPFManifest(), and rename buildEPUB3OPFManifest()? Personally, I prefer to merge.

2. I think it is better to merge writeK8OPF() to writeOPF().

3. Keeping buildOPF() to be separated because merging it cause many 'if statments' and decrease readability, it may better to merge buildEPUB2OPF() and buildEPUB3OPF() to buildK8OPF().

4. removing constants OPF_NAME, TOC_NCX and NAVIGATION_DOCUMENT, and changing to hard code due to almost no necessity.

Besides, I have modified 'mobi_split.py' to make processing faster. I have considered to post it after v0.73 release in oder to avoid more testing. But I will post it if some one want.

Thanks,

JSWolf · 06-29-2014, 10:59 AM

Quote:

Originally Posted by KevinH

Hi,

In short no. I need that info to help debug new metadata and features being implemented by Amazon. It will not hurt or impact anything in the epub.

Why do you want it removed? It just shows what the metadata was inside the original mobi. Kindlegen will ignore it if you pass that epub back through kindlegen.

KevinH

I don't care about that metadata in the AZW3 itself. I'm just looking for things to be automated so when the code shift to ePub is done, that data is dumped as ePub has no need of it at all. All it is for the ePub is code bloat.

KevinH · 06-29-2014, 11:09 AM

Hi,
As I said, it is not going to happen. The extra metadata will not hurt anything and shows what was in the original azw3 which helps when diagnosing new Kindlegen features.

This tool is not really an azw3 to epub converter because it is not guaranteed to even generate an epub that meets spec. It is meant to unpack the AZW3/Mobi file so that modifications can be made, html/css code differences can be detected, etc, and then passed back thru kindlegen to create a new azw3/mobi. Any epub-like structure generated by KindleUnpack should be tested, and edited in Sigil (or any text editor) and fixed. During that process feel free to hack any unwanted metadata out.

KevinH

Quote:

Originally Posted by JSWolf

I don't care about that metadata in the AZW3 itself. I'm just looking for things to be automated so when the code shift to ePub is done, that data is dumped as ePub has no need of it at all. All it is for the ePub is code bloat.

DiapDealer · 06-29-2014, 11:15 AM

Quote:

Originally Posted by JSWolf

I don't care about that metadata in the AZW3 itself. I'm just looking for things to be automated so when the code shift to ePub is done, that data is dumped as ePub has no need of it at all. All it is for the ePub is code bloat.

As mentioned in my post, KindleUnpack's main goal is not (directly) about handy-dandy format shifting (nor is it about creating the sleekest ePubs). The metadata is staying. That's pretty-much all there is to say. You'll just have to delete it if it bugs you that badly.

KevinH · 06-29-2014, 03:19 PM

Hi tkeo,

I have started studying the RESC, new OPF, and taglist code, and I had some questions:

1. It looks like there is a lot of code that is simply grabbing the original idref from the RESC section and then trying to make sure that none of that duplicates anything we use. Why is it important or at all useful to keep the idrefs from the RESC and re-use them in the new opf?

Given we do not know the original file names, the original idrefs seem more than a bit meaningless. It would make the code much easier to follow and support if we simply ignore all of these original idrefs from the RESC and simply sequentially number ours as was done by KindleUnpack originally. That should simplify or eliminate the need for the code to get all of the skeleton/partinfo from k8proc, shouldn't it?

2. Can we please move parseK8RESC() out of mobi_opf and into k8resc so that the k8resc object encapsulates all RESC decoding. The opf routine can call into k8resc to get and add the extra metadata information as needed. We do not need to keep the distinction as to the source (it will either be from the mobi exth header or the RESC, and so it shouldn't matter).

3. Do we really need to use a full blown HTMLParser() and all of the additional regular expression code just to parse the RESC section? This seems overkill at best. The problem with HTMLParsers in general is that they are not robust and can easily freak out over improper bytes (i.e leave an imbedded null lying around in the html/xml file and watch them barf all over it). And it almost looks like your regular expression metadata parsing code is trying to act like a full blown HTMLParser of some sort instead of just extracting what you want. Perhaps this is the only way but I am not convinced yet. So isn't some way to simply walk the RESC data and extract what we want more simply with much less code in general? Edit: Please see my attached simple proof of concept code.

4. There really is way too much overlap in all of the code for the various use cases in mobi_opf. There really is no reason to split opf writing out for mobi7 vs mobi8 for epub 2. Our original routine handled that case just fine.

So I envision one routine for both mobi7 and mobi8 epub 2 with calls out to shared support routines which we will pull out and identify much like you have already done. And a second separate routine for mobi8 epub3 with calls out to many of the same shared support routines. How does that sound?

We want to keep KindleUnpack as simple and straightforward as possible with as little fluff as possible to make support and learning from the code easy to do.

Please let me know what you think?

ps.:

As a ***very rough*** proof of concept I threw together a very simple program to parse the RESCXXXXX.dat files generated by KindleUnpack.py. It does not do anything other than parse the RESCXXXX.dat file. But it returns the prefix and path and a dictionary with all of the attributes and any related content. Once we throw out all of the main routine, and utf8 parsing of the command line nonsense, you will see it is quite small and easy to adapt as we see fit.
So we simply walk the RESC file tag by tag checking for the tags and things we want to further process in some way and build the tables and things needed inside k8resc.

For example, a list of skelids (which are kf8 skeleton numbers) and their page-properties could easily be generated on the fly, as could extracting the meta data into whatever form we want (epub 2 or epub 3) and storing it in the correct form in the k8resc object to be called for either in the main routine or in the opf as needed.

Please run it on one of the RESCXXXXX.dat files from one of your more complicated epub3 based ebooks converted to mobi and let me know if you think this type of very simple approach will work for us.

Thanks,

KevinH

NiLuJe · 06-29-2014, 09:42 PM

@Doitsu: Yup, it's a byproduct of the new 'PiP' chapter browsing since FW 5.4. Tap the lower left corner of the screen to toggle between Locations/Pages/Time Left in Chapter/Time Left In Book/Nothing

.

Doitsu · 06-30-2014, 05:10 AM

Quote:

Originally Posted by NiLuJe

@Doitsu: Yup, it's a byproduct of the new 'PIP' chapter browsing since FW 5.4. Tap the lower left corner of the screen to toggle between Locations/Pages/Time Left in Chapter/Time Left In Book/Nothing

.

Thanks! Yet another mystery solved.

tkeo · 06-30-2014, 07:44 AM

Quote:

Originally Posted by KevinH

1. It looks like there is a lot of code that is simply grabbing the original idref from the RESC section and then trying to make sure that none of that duplicates anything we use. Why is it important or at all useful to keep the idrefs from the RESC and re-use them in the new opf?

Given we do not know the original file names, the original idrefs seem more than a bit meaningless. It would make the code much easier to follow and support if we simply ignore all of these original idrefs from the RESC and simply sequentially number ours as was done by KindleUnpack originally. That should simplify or eliminate the need for the code to get all of the skeleton/partinfo from k8proc, shouldn't it?

Your are quite right. Retrieval of original idrefs causes much complication of the code; however, I'd like to keep it in KindleUnpack, in order to reconstruct the source more precisely.
In addition, to get cover page creation condition needs skelid in RESC and part in k8proc.

Quote:

2. Can we please move parseK8RESC() out of mobi_opf and into k8resc so that the k8resc object encapsulates all RESC decoding. The opf routine can call into k8resc to get and add the extra metadata information as needed. We do not need to keep the distinction as to the source (it will either be from the mobi exth header or the RESC, and so it shouldn't matter).

I will do so.

Quote:

3. Do we really need to use a full blown HTMLParser() and all of the additional regular expression code just to parse the RESC section? This seems overkill at best. The problem with HTMLParsers in general is that they are not robust and can easily freak out over improper bytes (i.e leave an imbedded null lying around in the html/xml file and watch them barf all over it). And it almost looks like your regular expression metadata parsing code is trying to act like a full blown HTMLParser of some sort instead of just extracting what you want. Perhaps this is the only way but I am not convinced yet. So isn't some way to simply walk the RESC data and extract what we want more simply with much less code in general? Edit: Please see my attached simple proof of concept code.

(Though the code I have written has limitations and cannot parse HTML fully. It can parse single layered elements only...)

I have run your code. It's fine!! Please change to your code although more improvement of the code is required to parse RESC. Be careful to parse comments especially multi-line ones.

At the start point, I have no knowledge what is necessary to retrieve from RESC, so I have made that.But now I know there is no necessity of it. I have once considered to change to use functions in mobi_taglist.py a little bit simpler than Metadata class.

Quote:

4. There really is way too much overlap in all of the code for the various use cases in mobi_opf. There really is no reason to split opf writing out for mobi7 vs mobi8 for epub 2. Our original routine handled that case just fine.

So I envision one routine for both mobi7 and mobi8 epub 2 with calls out to shared support routines which we will pull out and identify much like you have already done. And a second separate routine for mobi8 epub3 with calls out to many of the same shared support routines. How does that sound?

We want to keep KindleUnpack as simple and straightforward as possible with as little fluff as possible to make support and learning from the code easy to do.

I prefer to split mobi7 vs mobi8(epub2 and epub3) because the epub2 opf is more similar to epub3 opf than mobi7.
But I think the main reason of the difference of the opinion between mine and yours is because I am a new comer and not familiar to the older versions of the KindleUnpack. So, please choose one as you like.

Thanks,

tkeo · 06-30-2014, 08:17 AM

Hi,

This is the faster version of mobi_split.py.

It can be switched original code and modified one, by FAST_MODE constant.

Below is the example of the improvement.

test file: HDimage_test.mobi, 16MB including src.
Attached in https://www.mobileread.com/forums/showpost.php?p=2851879&postcount=779

org
mobi_split: mobi7 processing time 0.08s
mobi_split: mobi8 processing time 0.24s

modified
mobi_split: mobi7 processing time 0.05s
mobi_split: mobi8 processing time 0.06s

Thanks,

KevinH · 06-30-2014, 10:01 AM

Hi tkeo,

Quote:

Originally Posted by tkeo

Your are quite right. Retrieval of original idrefs causes much complication of the code; however, I'd like to keep it in KindleUnpack, in order to reconstruct the source more precisely.
In addition, to get cover page creation condition needs skelid in RESC and part in k8proc.

But why keep it? The idrefs are just temporary tags and they are truly meaningless in that they convey no information other than uniqueness. They don't help make reconstructing the source more precisely since any unique id work to do that. You can not use them to deduce anything new that we actually need.

That means the code that tries to generate uniqueness of idrefs is just a source of potential problems, and adds code complications and code size for no true added benefit. Please remove it unless you can demonstrate that using the idrefs actually improves functionality in some way. For the cover, you can parse the idref and then let opf give it our own unique idref with no extra code.

Thanks,

KevinH

KevinH · 06-30-2014, 10:19 AM

Hi tkeo,

Quote:

Originally Posted by tkeo

I have run your code. It's fine!! Please change to your code although more improvement of the code is required to parse RESC. Be careful to parse comments especially multi-line ones.

Please let me know what improvements you think we still need to just *PARSE* the RESC? Please be specific or post a problem testcase and I will be happy to modify it to add that support. I realize that much code must be added after the parsing stage to do the real work so I am just asking about the parsing stage itself.

Edit: Ah, I see tricks can be played using a multiline comment to hide or invalidate a block of tags! So I will have to special case comments even more when parsing to grab everything up to the "-->" no matter how far ahead it is.

If you have an RESCxxxxx.dat file that is very complicated, I would love to have it as a testcase. I won't need the entire book just the RESCxxxxx.dat file. Thanks.

My idea is that we change the parseRESC code to use a loop with yield so we can then create our own RESC iterator. Then using one loop in k8resc we check each tag name sequentially that exists in the RESC, and process the meta data we need building up an epub2 or 3 version on the fly, find the cover info, get the spine attributes, and grab the skelids and any properties associated with them and store all of this away in the k8resc object for later retrieval.

That should handle everything we need correct?

Quote:

I prefer to split mobi7 vs mobi8(epub2 and epub3) because the epub2 opf is more similar to epub3 opf than mobi7.

The mobi7 is just a much simpler subset of the mobi8 epub2 one. If you want to keep it separate then please do. But I thought you said that splitting out epub3 from epub2 was more important to create fewer ifs and simpler/cleaner code?

I am willing to do whatever you feel is best in the opf code as long as it reduces redundancies and creates simpler code overall.

Thanks!

KevinH

tkeo · 07-01-2014, 08:10 AM

Hi Kevin,

Quote:

Originally Posted by KevinH

But why keep it? The idrefs are just temporary tags and they are truly meaningless in that they convey no information other than uniqueness. They don't help make reconstructing the source more precisely since any unique id work to do that. You can not use them to deduce anything new that we actually need.

I have books in which idrefs have meaningful names, for example: p-cover, p-titlepage, p-caution, p-copyright.
Indeed they do not improve functionality, but is it not enough for reason to keep them?

Thanks,

tkeo · 07-01-2014, 09:16 AM

Hi Kevin,

Quote:

Originally Posted by KevinH

If you have an RESCxxxxx.dat file that is very complicated, I would love to have it as a testcase. I won't need the entire book just the RESCxxxxx.dat file. Thanks.

I attach a RESC.dat. Please test it.

Quote:

My idea is that we change the parseRESC code to use a loop with yield so we can then create our own RESC iterator. Then using one loop in k8resc we check each tag name sequentially that exists in the RESC, and process the meta data we need building up an epub2 or 3 version on the fly, find the cover info, get the spine attributes, and grab the skelids and any properties associated with them and store all of this away in the k8resc object for later retrieval.

That should handle everything we need correct?

I think so perphaps. I will consider more carefully later.

Could you use OrderedDict in order to keep attribues order?
It will make easier to check bugs by comparing reconstructed opfs between KindleUnpack versions using diff.

Quote:

The mobi7 is just a much simpler subset of the mobi8 epub2 one. If you want to keep it separate then please do. But I thought you said that splitting out epub3 from epub2 was more important to create fewer ifs and simpler/cleaner code?

I am willing to do whatever you feel is best in the opf code as long as it reduces redundancies and creates simpler code overall.

I think I meant to said splitting out mobi7 from epub2.

I will reconsider this.
But I have forgotten about auto detection of epub version at all.
We might come to one opf generation function which calls sub-functions in the end.

Thanks,

tkeo · 07-02-2014, 06:34 AM

Hi Kevin,

Quote:

Originally Posted by Kevin

So I envision one routine for both mobi7 and mobi8 epub 2 with calls out to shared support routines which we will pull out and identify much like you have already done. And a second separate routine for mobi8 epub3 with calls out to many of the same shared support routines. How does that sound?

I come to the conclusion that this is the best too.
I am sorry for bothering you.

Have a good travel!
tkeo

tkeo · 07-02-2014, 08:00 AM

Quote:

Originally Posted by tkeo

This is the faster version of mobi_split.py.

I test larger mobi. Here is the result.

test file: about 50MB, 300images, nosouce

org
mobi_split: mobi7 processing time 0.70s
mobi_split: mobi8 processing time 22.80s

modified
mobi_split: mobi7 processing time 0.56s
mobi_split: mobi8 processing time 0.28s

06-29-2014, 09:42 PM	#846
NiLuJe BLAM! Posts: 13,477 Karma: 26012494 Join Date: Jun 2010 Location: Paris, France Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E	@Doitsu: Yup, it's a byproduct of the new 'PiP' chapter browsing since FW 5.4. Tap the lower left corner of the screen to toggle between Locations/Pages/Time Left in Chapter/Time Left In Book/Nothing . Last edited by NiLuJe; 06-30-2014 at 12:33 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

Advert

Advert