MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 06-29-2014, 03:19 PM

Hi tkeo,

I have started studying the RESC, new OPF, and taglist code, and I had some questions:

1. It looks like there is a lot of code that is simply grabbing the original idref from the RESC section and then trying to make sure that none of that duplicates anything we use. Why is it important or at all useful to keep the idrefs from the RESC and re-use them in the new opf?

Given we do not know the original file names, the original idrefs seem more than a bit meaningless. It would make the code much easier to follow and support if we simply ignore all of these original idrefs from the RESC and simply sequentially number ours as was done by KindleUnpack originally. That should simplify or eliminate the need for the code to get all of the skeleton/partinfo from k8proc, shouldn't it?

2. Can we please move parseK8RESC() out of mobi_opf and into k8resc so that the k8resc object encapsulates all RESC decoding. The opf routine can call into k8resc to get and add the extra metadata information as needed. We do not need to keep the distinction as to the source (it will either be from the mobi exth header or the RESC, and so it shouldn't matter).

3. Do we really need to use a full blown HTMLParser() and all of the additional regular expression code just to parse the RESC section? This seems overkill at best. The problem with HTMLParsers in general is that they are not robust and can easily freak out over improper bytes (i.e leave an imbedded null lying around in the html/xml file and watch them barf all over it). And it almost looks like your regular expression metadata parsing code is trying to act like a full blown HTMLParser of some sort instead of just extracting what you want. Perhaps this is the only way but I am not convinced yet. So isn't some way to simply walk the RESC data and extract what we want more simply with much less code in general? Edit: Please see my attached simple proof of concept code.

4. There really is way too much overlap in all of the code for the various use cases in mobi_opf. There really is no reason to split opf writing out for mobi7 vs mobi8 for epub 2. Our original routine handled that case just fine.

So I envision one routine for both mobi7 and mobi8 epub 2 with calls out to shared support routines which we will pull out and identify much like you have already done. And a second separate routine for mobi8 epub3 with calls out to many of the same shared support routines. How does that sound?

We want to keep KindleUnpack as simple and straightforward as possible with as little fluff as possible to make support and learning from the code easy to do.

Please let me know what you think?

ps.:

As a ***very rough*** proof of concept I threw together a very simple program to parse the RESCXXXXX.dat files generated by KindleUnpack.py. It does not do anything other than parse the RESCXXXX.dat file. But it returns the prefix and path and a dictionary with all of the attributes and any related content. Once we throw out all of the main routine, and utf8 parsing of the command line nonsense, you will see it is quite small and easy to adapt as we see fit.
So we simply walk the RESC file tag by tag checking for the tags and things we want to further process in some way and build the tables and things needed inside k8resc.

For example, a list of skelids (which are kf8 skeleton numbers) and their page-properties could easily be generated on the fly, as could extracting the meta data into whatever form we want (epub 2 or epub 3) and storing it in the correct form in the k8resc object to be called for either in the main routine or in the opf as needed.

Please run it on one of the RESCXXXXX.dat files from one of your more complicated epub3 based ebooks converted to mobi and let me know if you think this type of very simple approach will work for us.

Thanks,

KevinH