MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 02-09-2014, 01:18 PM

Hi tkeo,

One other thing, I am not a big fan of minidom at all. It seems generally bloated and barfs if any true unicode is used (at least on 2.X). I see you wrote both a xml.dom.minidom version and a regular expression version of things. Every time I have used a xml elementTree or some other XML parser (either standard package or add-ons) in python 2.X I have run into problem cases that simply do not parse well or get confused with encodings, resulting in non-robust operation on some platforms (Mac, Win, or Linux).

So unless you feel strongly about it (and given the re vs dom code sizes are about the same), I would rather stick with regular expressions version as they are easier for people to modify and fix are are robust to most encoding issues.

I see you have also written a metadata parsing routine that supports epub 3 like "refines" on named items. This is quite nice but using it in epub 2 spec devices might cause problems.

I really think we should incorporate your code and try and create an epub 3 generator version of KindleUnpack to stay in epub 3 space and not try to mix private extensions into what is primarily epub 2 code.

What do you think?

KevinH

02-09-2014, 01:18 PM	#660
KevinH Sigil Developer Posts: 7,651 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi tkeo, One other thing, I am not a big fan of minidom at all. It seems generally bloated and barfs if any true unicode is used (at least on 2.X). I see you wrote both a xml.dom.minidom version and a regular expression version of things. Every time I have used a xml elementTree or some other XML parser (either standard package or add-ons) in python 2.X I have run into problem cases that simply do not parse well or get confused with encodings, resulting in non-robust operation on some platforms (Mac, Win, or Linux). So unless you feel strongly about it (and given the re vs dom code sizes are about the same), I would rather stick with regular expressions version as they are easier for people to modify and fix are are robust to most encoding issues. I see you have also written a metadata parsing routine that supports epub 3 like "refines" on named items. This is quite nice but using it in epub 2 spec devices might cause problems. I really think we should incorporate your code and try and create an epub 3 generator version of KindleUnpack to stay in epub 3 space and not try to mix private extensions into what is primarily epub 2 code. What do you think? KevinH