MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

DiapDealer · 01-01-2013, 10:10 PM

Quote:

Originally Posted by Sergey Dubinets

To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape &amp; in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.

In short: double unescaping is a bug and it results in "data loss".

I think you still might be mistaken. I'm not "double unescaping" anything. I'm first unescaping everything and then re-escaping only the three characters that must be escaped. Under the current code:

Code:

xmlescape(self.h.unescape(value)

HTMLParser.unescape() first takes: "Don't double unescape & in metadata".

And makes it: "Don't double unescape & in metadata".

Then saxutils.escape() makes it: "Don't double unescape & in metadata".

No data loss. And you can't create a mobi with kindlegen that preserves and displays the literal text "&" in the title anyway.

Your example is a perfect illustration of why I've chosen to do it the way I have. Without HTMLParser's initial unescape(), using the saxutils escape() method alone (which is required to handle any html tags or unescaped ampersands) would result in a valid "&" being turned into "&amp;". Just like you described.

The current method will preserve all pre-existing < > and & entities while converting any other entities encountered to their character representations and properly escaping any html tags and naked ampersands.