MobileRead Forums - View Single Post

kovidgoyal · 03-28-2013, 12:05 AM

Hmm, I'm somewhat ambivalent. I suspect the reason beautiful soup does what it does is that browsers renders &known_entity correctly. It's a tradeoff -- are there more pages that have unescaped ampersands than those that have entities without trailing semi-colons?

The patch could be made more robust by detecting if the text is a known entity name/numeric entity and ignoring it if it is. You could add a keyword argument to index_to_soup to turn on this behavior, which can be used by recipe authors if they know their site uses unescaped ampersands. That way, current behavior does not change.

I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.

03-28-2013, 12:05 AM	#2
kovidgoyal creator of calibre Posts: 45,500 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Hmm, I'm somewhat ambivalent. I suspect the reason beautiful soup does what it does is that browsers renders &known_entity correctly. It's a tradeoff -- are there more pages that have unescaped ampersands than those that have entities without trailing semi-colons? The patch could be made more robust by detecting if the text is a known entity name/numeric entity and ignoring it if it is. You could add a keyword argument to index_to_soup to turn on this behavior, which can be used by recipe authors if they know their site uses unescaped ampersands. That way, current behavior does not change. I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.