MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

tkeo · 02-11-2014, 10:24 AM

Hi KevinH,

Quote:

Originally Posted by KevinH

One other thing, I am not a big fan of minidom at all. It seems generally bloated and barfs if any true unicode is used (at least on 2.X). I see you wrote both a xml.dom.minidom version and a regular expression version of things. Every time I have used a xml elementTree or some other XML parser (either standard package or add-ons) in python 2.X I have run into problem cases that simply do not parse well or get confused with encodings, resulting in non-robust operation on some platforms (Mac, Win, or Linux).

So unless you feel strongly about it (and given the re vs dom code sizes are about the same), I would rather stick with regular expressions version as they are easier for people to modify and fix are are robust to most encoding issues.

I see you have also written a metadata parsing routine that supports epub 3 like "refines" on named items. This is quite nice but using it in epub 2 spec devices might cause problems.

Firstly, I have no reason to stick to using dom. So I will revert the code to use re. Parsing RESC section using the dom makes look code simpler and shorter than using re. (The re version needs a Metadata class I wrote, whereas the dom version does not.) I think the dom is suitable to represent an epub structure; however, currently, less familiar and less stable than the re.

On my environment (python 2.7.6 windows 32bit), the minidom is able to parse utf-8 and stored as an Unicode string; re-encoding utf-8 is necessary to use. It is quite confusing for me. (Utf-8 is one of the encodings of unicode, isn't it?) If the minidom stored elements as utf-8 strings, it would be very easy, I think.

Quote:

I really think we should incorporate your code and try and create an epub 3 generator version of KindleUnpack to stay in epub 3 space and not try to mix private extensions into what is primarily epub 2 code.

What do you think?

I have considered. It might be better to create the KindleUnpack of pure epub 3 version, separately to epub 2 version. Since the pure epub 3 ebooks will become popular but many epub 2 books will remain because of no necessity of the epub 3 features.

But, now, there are many epub 2 ebooks available but epub3 books are not so popular, and as you mentioned, books basically based on epub2 plus partly included epub3 definition are published from vendors. So, I think, it is not time for creating the pure epub3 version of the KindleUnpack.

Quote:

Thank you for the examples, I will play around with them. I can't believe that a Kindle device supports the spine/page spread properties by keeping and parsing the RESC section on the fly during reading. My guess is they must include or encode that information in some other way but I that is just a guess and I could be wrong.

As for the RESC section. I've guessed, the RESC section is prepared to store the information that K8 format does not define. It's just guess no evidence.

Thanks,
tkeo