MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

tkeo · 06-30-2014, 07:44 AM

Quote:

Originally Posted by KevinH

1. It looks like there is a lot of code that is simply grabbing the original idref from the RESC section and then trying to make sure that none of that duplicates anything we use. Why is it important or at all useful to keep the idrefs from the RESC and re-use them in the new opf?

Given we do not know the original file names, the original idrefs seem more than a bit meaningless. It would make the code much easier to follow and support if we simply ignore all of these original idrefs from the RESC and simply sequentially number ours as was done by KindleUnpack originally. That should simplify or eliminate the need for the code to get all of the skeleton/partinfo from k8proc, shouldn't it?

Your are quite right. Retrieval of original idrefs causes much complication of the code; however, I'd like to keep it in KindleUnpack, in order to reconstruct the source more precisely.
In addition, to get cover page creation condition needs skelid in RESC and part in k8proc.

Quote:

2. Can we please move parseK8RESC() out of mobi_opf and into k8resc so that the k8resc object encapsulates all RESC decoding. The opf routine can call into k8resc to get and add the extra metadata information as needed. We do not need to keep the distinction as to the source (it will either be from the mobi exth header or the RESC, and so it shouldn't matter).

I will do so.

Quote:

3. Do we really need to use a full blown HTMLParser() and all of the additional regular expression code just to parse the RESC section? This seems overkill at best. The problem with HTMLParsers in general is that they are not robust and can easily freak out over improper bytes (i.e leave an imbedded null lying around in the html/xml file and watch them barf all over it). And it almost looks like your regular expression metadata parsing code is trying to act like a full blown HTMLParser of some sort instead of just extracting what you want. Perhaps this is the only way but I am not convinced yet. So isn't some way to simply walk the RESC data and extract what we want more simply with much less code in general? Edit: Please see my attached simple proof of concept code.

(Though the code I have written has limitations and cannot parse HTML fully. It can parse single layered elements only...)

I have run your code. It's fine!! Please change to your code although more improvement of the code is required to parse RESC. Be careful to parse comments especially multi-line ones.

At the start point, I have no knowledge what is necessary to retrieve from RESC, so I have made that.But now I know there is no necessity of it. I have once considered to change to use functions in mobi_taglist.py a little bit simpler than Metadata class.

Quote:

4. There really is way too much overlap in all of the code for the various use cases in mobi_opf. There really is no reason to split opf writing out for mobi7 vs mobi8 for epub 2. Our original routine handled that case just fine.

So I envision one routine for both mobi7 and mobi8 epub 2 with calls out to shared support routines which we will pull out and identify much like you have already done. And a second separate routine for mobi8 epub3 with calls out to many of the same shared support routines. How does that sound?

We want to keep KindleUnpack as simple and straightforward as possible with as little fluff as possible to make support and learning from the code easy to do.

I prefer to split mobi7 vs mobi8(epub2 and epub3) because the epub2 opf is more similar to epub3 opf than mobi7.
But I think the main reason of the difference of the opinion between mine and yours is because I am a new comer and not familiar to the older versions of the KindleUnpack. So, please choose one as you like.

Thanks,