MobileRead Forums - View Single Post

nickredding · 03-27-2013, 06:24 PM

Kovid - There is a problem with literal ampersands encountered by index_to_soup in news.py.

If a phrase such as "the S&P 500 closed higher" is encountered by BeautifulSoup this is changed to "The S&P; closed higher" because the characters '&P' are interpreted as an unknown HTML entity and are replaced by "&P;".

My proposed fix for this is to replace lines 667-668 of index_to_soup with

Code:

        massage.append((re.compile(r'&([^;\s]*)(\s)'), lambda match:
            '&amp;'+match.group(1)+match.group(2)))
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, exceptions=['amp'], encoding=enc)))

This first identifies literal ampersands and replaces them with "&" and then does the entity_to_unicode on all HTML entities except "&".

I've found that exceptions=['amp'] is necessary. Otherwise, the erroneous substitution still occurs even after the literal ampersands are replaced by "&". I'm not quite sure why--the BeautifulSoup code is rather opaque.

The result of this is a parsed index page with literal ampersands represented by "&". I don't believe this will create an issue for target devices since "&" is standard HTML for the literal ampersand.

The only exception I have found that needs to be dealt with is if a recipe is extracting section names from an index page the target device might not recognize HTML entities in section names (this is definitely true for Kindle), so in parsing the index, section names that have "&" in them need to have those replaced by a literal ampersand ("&").

I realize this is somewhat complicated, and seeing "S&P; 500" instead of "S&P 500" isn't very serious, so if you want to ignore this I'll understand.

03-27-2013, 06:24 PM	#1
nickredding onlinenewsreader.net Posts: 334 Karma: 10143 Join Date: Dec 2009 Location: Kelowna BC Device: Various	Literal ampersands in index_to_soup Kovid - There is a problem with literal ampersands encountered by index_to_soup in news.py. If a phrase such as "the S&P 500 closed higher" is encountered by BeautifulSoup this is changed to "The S&P; closed higher" because the characters '&P' are interpreted as an unknown HTML entity and are replaced by "&P;". My proposed fix for this is to replace lines 667-668 of index_to_soup with Code: massage.append((re.compile(r'&([^;\s]*)(\s)'), lambda match: '&'+match.group(1)+match.group(2))) massage.append((re.compile(r'&(\S+?);'), lambda match: entity_to_unicode(match, exceptions=['amp'], encoding=enc))) This first identifies literal ampersands and replaces them with "&" and then does the entity_to_unicode on all HTML entities except "&". I've found that exceptions=['amp'] is necessary. Otherwise, the erroneous substitution still occurs even after the literal ampersands are replaced by "&". I'm not quite sure why--the BeautifulSoup code is rather opaque. The result of this is a parsed index page with literal ampersands represented by "&". I don't believe this will create an issue for target devices since "&" is standard HTML for the literal ampersand. The only exception I have found that needs to be dealt with is if a recipe is extracting section names from an index page the target device might not recognize HTML entities in section names (this is definitely true for Kindle), so in parsing the index, section names that have "&" in them need to have those replaced by a literal ampersand ("&"). I realize this is somewhat complicated, and seeing "S&P; 500" instead of "S&P 500" isn't very serious, so if you want to ignore this I'll understand.