View Single Post
Old 11-04-2010, 12:28 PM   #1
oecherprinte
Zealot
oecherprinte began at the beginning.
 
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
index_to_soup: how can I sanitize the html using markupMassage

Hi,

I am writing a recipe for a web page that includes an error which completely confuses beautiful soup. When I use the convenience function index_to_soup I can generate beautiful soup from an html file. However, I would have to use the markupMassage feature of beautiful soup to remove some errors from the html file before converting it into beautiful soup:

http://www.crummy.com/software/Beaut...mentation.html

Are there any parameters or other mechanisms to pass the markupMassage list to index_to_soup? I have notices that the function does something with the makupMassage Parameter when generating the beautiful soup:

Code:
  massage = list(BeautifulSoup.MARKUP_MASSAGE)
        enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, encoding=enc)))
        return BeautifulSoup(_raw, markupMassage=massage)
So there must be some way of passing my personal markupMassage list to index_to_soup?

Thanks,

Jens
oecherprinte is offline   Reply With Quote