Quote:
Originally Posted by Sergey Dubinets
To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape & in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.
In short: double unescaping is a bug and it results in "data loss".
|
I think you still might be mistaken. I'm not "double unescaping" anything. I'm first
unescaping everything and then
re-escaping only the three characters that
must be escaped. Under the current code:
Code:
xmlescape(self.h.unescape(value)
HTMLParser.
unescape() first takes: "Don't double unescape & in metadata".
And makes it: "Don't double unescape & in metadata".
Then saxutils.
escape() makes it: "Don't double unescape & in metadata".
No data loss. And you can't create a mobi with kindlegen that preserves and displays the literal text "&" in the title anyway.
Your example is a perfect illustration of
why I've chosen to do it the way I have. Without HTMLParser's initial unescape(), using the saxutils escape() method alone (which is required to handle any html tags or unescaped ampersands) would result in a valid "&" being turned into "&". Just like you described.
The current method will preserve all pre-existing < > and & entities while converting any other entities encountered to their character representations and properly escaping any html tags and naked ampersands.