Quote:
Originally Posted by oecherprinte
Are there any parameters or other mechanisms to pass the markupMassage list to index_to_soup? I have notices that the function does something with the makupMassage Parameter when generating the beautiful soup:
Code:
massage = list(BeautifulSoup.MARKUP_MASSAGE)
enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding
massage.append((re.compile(r'&(\S+?);'), lambda match:
entity_to_unicode(match, encoding=enc)))
return BeautifulSoup(_raw, markupMassage=massage)
So there must be some way of passing my personal markupMassage list to index_to_soup?
Thanks,
Jens
|
Look at news.py for the def of index_to_soup.
I'll paste it here:
Spoiler:
Code:
def index_to_soup(self, url_or_raw, raw=False):
'''
Convenience method that takes an URL to the index page and returns
a `BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/documentation.html>`_
of it.
`url_or_raw`: Either a URL or the downloaded index page as a string
'''
if re.match(r'\w+://', url_or_raw):
open_func = getattr(self.browser, 'open_novisit', self.browser.open)
with closing(open_func(url_or_raw)) as f:
_raw = f.read()
if not _raw:
raise RuntimeError('Could not fetch index from %s'%url_or_raw)
else:
_raw = url_or_raw
if raw:
return _raw
if not isinstance(_raw, unicode) and self.encoding:
if callable(self.encoding):
_raw = self.encoding(_raw)
else:
_raw = _raw.decode(self.encoding, 'replace')
massage = list(BeautifulSoup.MARKUP_MASSAGE)
enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding
massage.append((re.compile(r'&(\S+?);'), lambda match:
entity_to_unicode(match, encoding=enc)))
return BeautifulSoup(_raw, markupMassage=massage)
You should be able to modify the massage = list for whatever you need.