Literal ampersands in index_to_soup

nickredding · 03-27-2013, 05:24 PM

Kovid - There is a problem with literal ampersands encountered by index_to_soup in news.py.

If a phrase such as "the S&P 500 closed higher" is encountered by BeautifulSoup this is changed to "The S&P; closed higher" because the characters '&P' are interpreted as an unknown HTML entity and are replaced by "&P;".

My proposed fix for this is to replace lines 667-668 of index_to_soup with

Code:

        massage.append((re.compile(r'&([^;\s]*)(\s)'), lambda match:
            '&amp;'+match.group(1)+match.group(2)))
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, exceptions=['amp'], encoding=enc)))

This first identifies literal ampersands and replaces them with "&" and then does the entity_to_unicode on all HTML entities except "&".

I've found that exceptions=['amp'] is necessary. Otherwise, the erroneous substitution still occurs even after the literal ampersands are replaced by "&". I'm not quite sure why--the BeautifulSoup code is rather opaque.

The result of this is a parsed index page with literal ampersands represented by "&". I don't believe this will create an issue for target devices since "&" is standard HTML for the literal ampersand.

The only exception I have found that needs to be dealt with is if a recipe is extracting section names from an index page the target device might not recognize HTML entities in section names (this is definitely true for Kindle), so in parsing the index, section names that have "&" in them need to have those replaced by a literal ampersand ("&").

I realize this is somewhat complicated, and seeing "S&P; 500" instead of "S&P 500" isn't very serious, so if you want to ignore this I'll understand.

kovidgoyal · 03-28-2013, 12:05 AM

Hmm, I'm somewhat ambivalent. I suspect the reason beautiful soup does what it does is that browsers renders &known_entity correctly. It's a tradeoff -- are there more pages that have unescaped ampersands than those that have entities without trailing semi-colons?

The patch could be made more robust by detecting if the text is a known entity name/numeric entity and ignoring it if it is. You could add a keyword argument to index_to_soup to turn on this behavior, which can be used by recipe authors if they know their site uses unescaped ampersands. That way, current behavior does not change.

I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.

kovidgoyal · 03-28-2013, 12:07 AM

Also, what happens to unescaped ampersands in the article text? If those are also mangled,then you could make it a class level variable rather than a parameter to index_to_soup and have fetch/simple.py also use it when parsing the articles pages.

t3d · 02-24-2014, 07:23 PM

Quote:

Originally Posted by kovidgoyal

I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.

Where to set these?
I have a similar issue with quotation marks and   sometimes won't work.
Propably it could be solved by setting a convertEntities parameter during soup creation in BasicNewsRecipe class?http://www.crummy.com/software/Beaut...ity Conversion

kovidgoyal · 02-24-2014, 09:25 PM

Simply re-implement index_to_soup in your recipe and pass whatever parameters you like to BeautifulSoup.

03-27-2013, 05:24 PM	#1
nickredding onlinenewsreader.net Posts: 324 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	Literal ampersands in index_to_soup Kovid - There is a problem with literal ampersands encountered by index_to_soup in news.py. If a phrase such as "the S&P 500 closed higher" is encountered by BeautifulSoup this is changed to "The S&P; closed higher" because the characters '&P' are interpreted as an unknown HTML entity and are replaced by "&P;". My proposed fix for this is to replace lines 667-668 of index_to_soup with Code: massage.append((re.compile(r'&([^;\s]*)(\s)'), lambda match: '&'+match.group(1)+match.group(2))) massage.append((re.compile(r'&(\S+?);'), lambda match: entity_to_unicode(match, exceptions=['amp'], encoding=enc))) This first identifies literal ampersands and replaces them with "&" and then does the entity_to_unicode on all HTML entities except "&". I've found that exceptions=['amp'] is necessary. Otherwise, the erroneous substitution still occurs even after the literal ampersands are replaced by "&". I'm not quite sure why--the BeautifulSoup code is rather opaque. The result of this is a parsed index page with literal ampersands represented by "&". I don't believe this will create an issue for target devices since "&" is standard HTML for the literal ampersand. The only exception I have found that needs to be dealt with is if a recipe is extracting section names from an index page the target device might not recognize HTML entities in section names (this is definitely true for Kindle), so in parsing the index, section names that have "&" in them need to have those replaced by a literal ampersand ("&"). I realize this is somewhat complicated, and seeing "S&P; 500" instead of "S&P 500" isn't very serious, so if you want to ignore this I'll understand.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is it possible to change all commas to ampersands in author column? + more...	africalass	Library Management	18	04-01-2011 03:06 PM
index_to_soup: how can I sanitize the html using markupMassage	oecherprinte	Recipes	4	11-05-2010 07:34 AM
Spiritual Young, Robert (tr.): Bible: Young's Literal Translation v1. 17 Dec. 08	ProDigit	BBeB/LRF Books	4	04-25-2009 09:28 PM

03-28-2013, 12:05 AM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Hmm, I'm somewhat ambivalent. I suspect the reason beautiful soup does what it does is that browsers renders &known_entity correctly. It's a tradeoff -- are there more pages that have unescaped ampersands than those that have entities without trailing semi-colons? The patch could be made more robust by detecting if the text is a known entity name/numeric entity and ignoring it if it is. You could add a keyword argument to index_to_soup to turn on this behavior, which can be used by recipe authors if they know their site uses unescaped ampersands. That way, current behavior does not change. I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.

03-28-2013, 12:07 AM	#3
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Also, what happens to unescaped ampersands in the article text? If those are also mangled,then you could make it a class level variable rather than a parameter to index_to_soup and have fetch/simple.py also use it when parsing the articles pages.

02-24-2014, 09:25 PM	#5
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Simply re-implement index_to_soup in your recipe and pass whatever parameters you like to BeautifulSoup.