View Single Post
Old 03-27-2013, 05:24 PM   #1
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 328
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Literal ampersands in index_to_soup

Kovid - There is a problem with literal ampersands encountered by index_to_soup in news.py.

If a phrase such as "the S&P 500 closed higher" is encountered by BeautifulSoup this is changed to "The S&P; closed higher" because the characters '&P' are interpreted as an unknown HTML entity and are replaced by "&P;".

My proposed fix for this is to replace lines 667-668 of index_to_soup with
Code:
        massage.append((re.compile(r'&([^;\s]*)(\s)'), lambda match:
            '&'+match.group(1)+match.group(2)))
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, exceptions=['amp'], encoding=enc)))
This first identifies literal ampersands and replaces them with "&" and then does the entity_to_unicode on all HTML entities except "&".

I've found that exceptions=['amp'] is necessary. Otherwise, the erroneous substitution still occurs even after the literal ampersands are replaced by "&". I'm not quite sure why--the BeautifulSoup code is rather opaque.

The result of this is a parsed index page with literal ampersands represented by "&". I don't believe this will create an issue for target devices since "&" is standard HTML for the literal ampersand.

The only exception I have found that needs to be dealt with is if a recipe is extracting section names from an index page the target device might not recognize HTML entities in section names (this is definitely true for Kindle), so in parsing the index, section names that have "&" in them need to have those replaced by a literal ampersand ("&").

I realize this is somewhat complicated, and seeing "S&P; 500" instead of "S&P 500" isn't very serious, so if you want to ignore this I'll understand.
nickredding is offline   Reply With Quote