Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-27-2013, 05:24 PM   #1
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Literal ampersands in index_to_soup

Kovid - There is a problem with literal ampersands encountered by index_to_soup in news.py.

If a phrase such as "the S&P 500 closed higher" is encountered by BeautifulSoup this is changed to "The S&P; closed higher" because the characters '&P' are interpreted as an unknown HTML entity and are replaced by "&P;".

My proposed fix for this is to replace lines 667-668 of index_to_soup with
Code:
        massage.append((re.compile(r'&([^;\s]*)(\s)'), lambda match:
            '&'+match.group(1)+match.group(2)))
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, exceptions=['amp'], encoding=enc)))
This first identifies literal ampersands and replaces them with "&" and then does the entity_to_unicode on all HTML entities except "&".

I've found that exceptions=['amp'] is necessary. Otherwise, the erroneous substitution still occurs even after the literal ampersands are replaced by "&". I'm not quite sure why--the BeautifulSoup code is rather opaque.

The result of this is a parsed index page with literal ampersands represented by "&". I don't believe this will create an issue for target devices since "&" is standard HTML for the literal ampersand.

The only exception I have found that needs to be dealt with is if a recipe is extracting section names from an index page the target device might not recognize HTML entities in section names (this is definitely true for Kindle), so in parsing the index, section names that have "&" in them need to have those replaced by a literal ampersand ("&").

I realize this is somewhat complicated, and seeing "S&P; 500" instead of "S&P 500" isn't very serious, so if you want to ignore this I'll understand.
nickredding is offline   Reply With Quote
Old 03-28-2013, 12:05 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Hmm, I'm somewhat ambivalent. I suspect the reason beautiful soup does what it does is that browsers renders &known_entity correctly. It's a tradeoff -- are there more pages that have unescaped ampersands than those that have entities without trailing semi-colons?

The patch could be made more robust by detecting if the text is a known entity name/numeric entity and ignoring it if it is. You could add a keyword argument to index_to_soup to turn on this behavior, which can be used by recipe authors if they know their site uses unescaped ampersands. That way, current behavior does not change.

I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.
kovidgoyal is offline   Reply With Quote
Old 03-28-2013, 12:07 AM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Also, what happens to unescaped ampersands in the article text? If those are also mangled,then you could make it a class level variable rather than a parameter to index_to_soup and have fetch/simple.py also use it when parsing the articles pages.
kovidgoyal is offline   Reply With Quote
Old 02-24-2014, 07:23 PM   #4
t3d
Enthusiast
t3d began at the beginning.
 
Posts: 38
Karma: 10
Join Date: Nov 2009
Location: Poland
Device: kindle 1st gen, kindle dxg, kindle paperwhite2
Quote:
Originally Posted by kovidgoyal View Post
I think you can set convertHTMLEntities=True and escapeUnrecognizedEntities=True in the soup parser to achieve this, though I haven't tested it.
Where to set these?
I have a similar issue with quotation marks and   sometimes won't work.
Propably it could be solved by setting a convertEntities parameter during soup creation in BasicNewsRecipe class?http://www.crummy.com/software/Beaut...ity Conversion
t3d is offline   Reply With Quote
Old 02-24-2014, 09:25 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Simply re-implement index_to_soup in your recipe and pass whatever parameters you like to BeautifulSoup.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Is it possible to change all commas to ampersands in author column? + more... africalass Library Management 18 04-01-2011 03:06 PM
index_to_soup: how can I sanitize the html using markupMassage oecherprinte Recipes 4 11-05-2010 07:34 AM
Spiritual Young, Robert (tr.): Bible: Young's Literal Translation v1. 17 Dec. 08 ProDigit BBeB/LRF Books 4 04-25-2009 09:28 PM


All times are GMT -4. The time now is 09:28 AM.


MobileRead.com is a privately owned, operated and funded community.