Originally Posted by Sergey Dubinets
To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape &amp; in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.
In short: double unescaping is a bug and it results in "data loss".
I think you still might be mistaken. I'm not "double unescaping" anything. I'm first unescaping
everything and then re-escaping
only the three characters that must
be escaped. Under the current code:
() first takes: "Don't double unescape & in metadata".
And makes it: "Don't double unescape & in metadata".
() makes it: "Don't double unescape & in metadata".
No data loss. And you can't create a mobi with kindlegen that preserves and displays the literal text "&" in the title anyway.
Your example is a perfect illustration of why
I've chosen to do it the way I have. Without HTMLParser's initial unescape(), using the saxutils escape() method alone (which is required to handle any html tags or unescaped ampersands) would result in a valid "&" being turned into "&amp;". Just like you described.
The current method will preserve all pre-existing < > and & entities while converting any other entities encountered to their character representations and properly escaping any html tags and naked ampersands.