MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

DiapDealer · 12-31-2012, 02:11 PM

Quote:

Originally Posted by Sergey Dubinets

3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are

escaped? Unescaping on not escaped values would be a bug.
Using saxutils.escape() is correct for text nodes:
data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))

Quote:

Originally Posted by KevinH

I am not up-to-speed on what he wants to do here so I will ask DiapDealer to look at this again to make sure your concerns are dealt with.

I'm not certain I understand the logic of this statement:

Quote:

"Are you sure that original values are escaped? Unescaping on not escaped values would be a bug."

There is no "bug" that I can discern. HTMLParser's mostly undocumented "unescape" method is perhaps titled a bit misleading-ly? It's essentially an un-entity routine. And it's perfectly capable of dealing with "not escaped values."

Many Kindle books are starting to come down the pike with html and/or entities in the MOBI/KF8 EXTH metadata. While that may be acceptable in a MOBI/KF8 file, it's unacceptable according to XML/OPF specs (other than the standard 5 entities for XML). I see no point in creating a non-compliant OPF file, so...

If there are no named/numbered entities in the contents of the metadata, then HTMLParser.unescape() will simply have no effect on it. Nothing. No bug. If there ARE any named/numbered entities, however... HTMLParser.unescape() will first convert them all to their unicode/utf-8 counterpart character representations. Saxutils.escape() then takes care of xml-escaping the mandatory (< > &) characters to complete all XML/OPF compliance.

Descriptions often contain html paragraph formatting and the current method ensures that all html tags will be properly xml-escaped while at the same time, not completely destroying the intention of any unsupported (unsupported in XML/OPF) entities that may have been present in the MOBI/KF8 EXTH metadata.

I agree it may be overkill (some things could conceivably go from entity to character and back to entity, for instance). But I see no other method (meaning other standard python library method) to ensure that every potentially non-compliant hodge-podge of text, html, and entities becomes docile, XML/OPF-compliant entries.

Quote:

And is not suficient for attribute values:
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value))))

I later case you need also escape " as " and ' as '
I sugest you use quoteattr() for atributes instead of escape()

I take your point here. I've just not really run into any standard quotes (character or entity) bound for OPF meta attribute values before. I've only ever encountered them in stuff bound for OPF dc:metadata tags where they're not part of any quoted attribute values. That certainly doesn't mean they can't show up and blow things up, though.

But I'm not certain quoteattr() is the right approach, though -- as it can potentially change double-quotes to single quotes and vice-versa, depending on the situation. In such a case, I think it would make more sense to extend the escape() method by passing it the optional "entities" dictionary parameter, so that " and ' are xml-escaped as well as the three mandatory < > and &, rather than potentially changing double quotes to single quotes.

Code:

ENTITIES = {'"':'&quot;', "'":"&apos;"}
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value), ENTITIES)))