|
|
#1 |
|
Zealot
![]() Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
index_to_soup: how can I sanitize the html using markupMassage
Hi,
I am writing a recipe for a web page that includes an error which completely confuses beautiful soup. When I use the convenience function index_to_soup I can generate beautiful soup from an html file. However, I would have to use the markupMassage feature of beautiful soup to remove some errors from the html file before converting it into beautiful soup: http://www.crummy.com/software/Beaut...mentation.html Are there any parameters or other mechanisms to pass the markupMassage list to index_to_soup? I have notices that the function does something with the makupMassage Parameter when generating the beautiful soup: Code:
massage = list(BeautifulSoup.MARKUP_MASSAGE)
enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding
massage.append((re.compile(r'&(\S+?);'), lambda match:
entity_to_unicode(match, encoding=enc)))
return BeautifulSoup(_raw, markupMassage=massage)
Thanks, Jens |
|
|
|
|
|
#2 | |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I'll paste it here: Spoiler:
You should be able to modify the massage = list for whatever you need. |
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Zealot
![]() Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
@Starson17:
Thanks. Of course I could change the index_to_soup function. But I have the impression that I can set the variable BeautifulSoup.MARKUP_MASSAGE somewhere externally e.g. in the header of the recipe? |
|
|
|
|
|
#4 |
|
Zealot
![]() Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
It's a nightmare. My first contact to python is really messy. What terrible language to debug ...
Anyway, I tried fumbling around with setting a "BeautifulSoup.MARKUP_MASSAGE" variable in my recipe which did not work. Then I tried to copy the "index_to_soup" function into my recipe and renamed it to "my_index_to_soup". I copied all the imports from the "calibre.web.feeds.news" file. Now I get the error message: "ValueError: too many values to unpack" for the line "return BeautifulSoup(_raw, markupMassage=massage)" without even touching the code. I am giving up now. Isn't there an easy way to automatically remove erroneous html code from a file before transferring it into beautiful soup? Maybe the developer could help (by the way: I already donated via paypal last week :-) ). I could imagine that many recipe programmers are facing this problem. Cheers, Jens P.S.: My specific problem is the line "<!#BeginList>" in the html file which makes beautiful soup think that the remainder of the file is a single tag ... (sigh) Last edited by oecherprinte; 11-05-2010 at 07:13 AM. |
|
|
|
|
|
#5 |
|
Zealot
![]() Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
|
Nevermind. I just solved the problem:
As I said I just copied the index_to_soup function into my new recipe and renamed it to my_index_to_soup. Then I added the following lines before the "return" statement: Code:
#remove erroneous strings from input file
massage.append((re.compile("<!#BeginList>"), lambda match:''))
massage.append((re.compile("<!#EndList>"), lambda match:''))
|
|
|
|
| Advert | |
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Merging multiple HTML files into one HTML file | skoobwoman | Workshop | 45 | 07-11-2014 11:46 AM |
| Calibre Recipe HTML content differs from raw html of index.html. | krunk | Calibre | 4 | 09-20-2010 10:48 PM |
| HTML Book + non HTML TOC to epub | aarcane | Calibre | 4 | 03-02-2010 03:58 AM |
| Can we do this in HTML? | Nate the great | Workshop | 17 | 08-04-2009 12:02 PM |
| HTML How to? | jlbfoot | Sony Reader | 1 | 12-27-2008 12:51 PM |