MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

DiapDealer · 11-15-2012, 04:01 PM

I'm starting to get the idea that I'm chasing my own tail with regard to ensuring compliant OPF files.

I thought the escape method from the standard xml.sax library was working quite well on metadata items—and it is, in fact, converting all instances of '&' and '<' or '>' to xml compliant entities as it was intended. But I'm discovering that a lot of metadata out there (especially KF8 subjects/descriptions) seem to contain html entities. This, by itself, wouldn't pose a problem. The problem is that my xml escape method is dutifully whacking all the ampersands in those poor defenseless entities and turning them into gibberish, basically.

So in one more attempt to overthink a process... enter the criminally underutilized (not to mention unsung) "unescape" method of Python's HTMLParser module. The unescape method first converts all entities that may be present in the data to their unicode character representations (OPF files are utf-8/16 by spec, afterall). Only then does the xml escape method fixup any stray ampersands and/or left/right angle brackets.

All this rambling means that I have an updated mobi_opf.py script for you to consider, pdurrant.