11-04-2010, 12:28 PM | #1 |
Zealot
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
index_to_soup: how can I sanitize the html using markupMassage
Hi,
I am writing a recipe for a web page that includes an error which completely confuses beautiful soup. When I use the convenience function index_to_soup I can generate beautiful soup from an html file. However, I would have to use the markupMassage feature of beautiful soup to remove some errors from the html file before converting it into beautiful soup: http://www.crummy.com/software/Beaut...mentation.html Are there any parameters or other mechanisms to pass the markupMassage list to index_to_soup? I have notices that the function does something with the makupMassage Parameter when generating the beautiful soup: Code:
massage = list(BeautifulSoup.MARKUP_MASSAGE) enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding massage.append((re.compile(r'&(\S+?);'), lambda match: entity_to_unicode(match, encoding=enc))) return BeautifulSoup(_raw, markupMassage=massage) Thanks, Jens |
11-04-2010, 04:37 PM | #2 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I'll paste it here: Spoiler:
You should be able to modify the massage = list for whatever you need. |
|
Advert | |
|
11-04-2010, 05:26 PM | #3 |
Zealot
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
@Starson17:
Thanks. Of course I could change the index_to_soup function. But I have the impression that I can set the variable BeautifulSoup.MARKUP_MASSAGE somewhere externally e.g. in the header of the recipe? |
11-05-2010, 06:07 AM | #4 |
Zealot
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
It's a nightmare. My first contact to python is really messy. What terrible language to debug ...
Anyway, I tried fumbling around with setting a "BeautifulSoup.MARKUP_MASSAGE" variable in my recipe which did not work. Then I tried to copy the "index_to_soup" function into my recipe and renamed it to "my_index_to_soup". I copied all the imports from the "calibre.web.feeds.news" file. Now I get the error message: "ValueError: too many values to unpack" for the line "return BeautifulSoup(_raw, markupMassage=massage)" without even touching the code. I am giving up now. Isn't there an easy way to automatically remove erroneous html code from a file before transferring it into beautiful soup? Maybe the developer could help (by the way: I already donated via paypal last week :-) ). I could imagine that many recipe programmers are facing this problem. Cheers, Jens P.S.: My specific problem is the line "<!#BeginList>" in the html file which makes beautiful soup think that the remainder of the file is a single tag ... (sigh) Last edited by oecherprinte; 11-05-2010 at 06:13 AM. |
11-05-2010, 07:34 AM | #5 |
Zealot
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
Nevermind. I just solved the problem:
As I said I just copied the index_to_soup function into my new recipe and renamed it to my_index_to_soup. Then I added the following lines before the "return" statement: Code:
#remove erroneous strings from input file massage.append((re.compile("<!#BeginList>"), lambda match:'')) massage.append((re.compile("<!#EndList>"), lambda match:'')) |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Merging multiple HTML files into one HTML file | skoobwoman | Workshop | 45 | 07-11-2014 10:46 AM |
Calibre Recipe HTML content differs from raw html of index.html. | krunk | Calibre | 4 | 09-20-2010 09:48 PM |
HTML Book + non HTML TOC to epub | aarcane | Calibre | 4 | 03-02-2010 02:58 AM |
Can we do this in HTML? | Nate the great | Workshop | 17 | 08-04-2009 11:02 AM |
HTML How to? | jlbfoot | Sony Reader | 1 | 12-27-2008 11:51 AM |