Hmm, I'm somewhat ambivalent. I suspect the reason beautiful soup does what it does is that browsers renders &known_entity correctly. It's a tradeoff -- are there more pages that have unescaped ampersands than those that have entities without trailing semi-colons?
The patch could be made more robust by detecting if the text is a known entity name/numeric entity and ignoring it if it is. You could add a keyword argument to index_to_soup to turn on this behavior, which can be used by recipe authors if they know their site uses unescaped ampersands. That way, current behavior does not change.
I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.
|