View Single Post
Old 01-01-2013, 11:10 PM   #469
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,531
Karma: 43837842
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by Sergey Dubinets View Post
To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape & in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.

In short: double unescaping is a bug and it results in "data loss".
I think you still might be mistaken. I'm not "double unescaping" anything. I'm first unescaping everything and then re-escaping only the three characters that must be escaped. Under the current code:

Code:
xmlescape(self.h.unescape(value)
HTMLParser.unescape() first takes: "Don't double unescape & in metadata".

And makes it: "Don't double unescape & in metadata".

Then saxutils.escape() makes it: "Don't double unescape & in metadata".

No data loss. And you can't create a mobi with kindlegen that preserves and displays the literal text "&" in the title anyway.

Your example is a perfect illustration of why I've chosen to do it the way I have. Without HTMLParser's initial unescape(), using the saxutils escape() method alone (which is required to handle any html tags or unescaped ampersands) would result in a valid "&" being turned into "&". Just like you described.

The current method will preserve all pre-existing < > and & entities while converting any other entities encountered to their character representations and properly escaping any html tags and naked ampersands.

Last edited by DiapDealer; 01-01-2013 at 11:32 PM.
DiapDealer is offline   Reply With Quote