03-27-2013, 05:24 PM | #1 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Literal ampersands in index_to_soup
Kovid - There is a problem with literal ampersands encountered by index_to_soup in news.py.
If a phrase such as "the S&P 500 closed higher" is encountered by BeautifulSoup this is changed to "The S&P; closed higher" because the characters '&P' are interpreted as an unknown HTML entity and are replaced by "&P;". My proposed fix for this is to replace lines 667-668 of index_to_soup with Code:
massage.append((re.compile(r'&([^;\s]*)(\s)'), lambda match: '&'+match.group(1)+match.group(2))) massage.append((re.compile(r'&(\S+?);'), lambda match: entity_to_unicode(match, exceptions=['amp'], encoding=enc))) I've found that exceptions=['amp'] is necessary. Otherwise, the erroneous substitution still occurs even after the literal ampersands are replaced by "&". I'm not quite sure why--the BeautifulSoup code is rather opaque. The result of this is a parsed index page with literal ampersands represented by "&". I don't believe this will create an issue for target devices since "&" is standard HTML for the literal ampersand. The only exception I have found that needs to be dealt with is if a recipe is extracting section names from an index page the target device might not recognize HTML entities in section names (this is definitely true for Kindle), so in parsing the index, section names that have "&" in them need to have those replaced by a literal ampersand ("&"). I realize this is somewhat complicated, and seeing "S&P; 500" instead of "S&P 500" isn't very serious, so if you want to ignore this I'll understand. |
03-28-2013, 12:05 AM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Hmm, I'm somewhat ambivalent. I suspect the reason beautiful soup does what it does is that browsers renders &known_entity correctly. It's a tradeoff -- are there more pages that have unescaped ampersands than those that have entities without trailing semi-colons?
The patch could be made more robust by detecting if the text is a known entity name/numeric entity and ignoring it if it is. You could add a keyword argument to index_to_soup to turn on this behavior, which can be used by recipe authors if they know their site uses unescaped ampersands. That way, current behavior does not change. I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it. |
03-28-2013, 12:07 AM | #3 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Also, what happens to unescaped ampersands in the article text? If those are also mangled,then you could make it a class level variable rather than a parameter to index_to_soup and have fetch/simple.py also use it when parsing the articles pages.
|
02-24-2014, 07:23 PM | #4 | |
Enthusiast
Posts: 38
Karma: 10
Join Date: Nov 2009
Location: Poland
Device: kindle 1st gen, kindle dxg, kindle paperwhite2
|
Quote:
I have a similar issue with quotation marks and sometimes won't work. Propably it could be solved by setting a convertEntities parameter during soup creation in BasicNewsRecipe class?http://www.crummy.com/software/Beaut...ity Conversion |
|
02-24-2014, 09:25 PM | #5 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Simply re-implement index_to_soup in your recipe and pass whatever parameters you like to BeautifulSoup.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is it possible to change all commas to ampersands in author column? + more... | africalass | Library Management | 18 | 04-01-2011 03:06 PM |
index_to_soup: how can I sanitize the html using markupMassage | oecherprinte | Recipes | 4 | 11-05-2010 07:34 AM |
Spiritual Young, Robert (tr.): Bible: Young's Literal Translation v1. 17 Dec. 08 | ProDigit | BBeB/LRF Books | 4 | 04-25-2009 09:28 PM |