Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-04-2010, 01:28 PM   #1
oecherprinte
Zealot
oecherprinte began at the beginning.
 
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
index_to_soup: how can I sanitize the html using markupMassage

Hi,

I am writing a recipe for a web page that includes an error which completely confuses beautiful soup. When I use the convenience function index_to_soup I can generate beautiful soup from an html file. However, I would have to use the markupMassage feature of beautiful soup to remove some errors from the html file before converting it into beautiful soup:

http://www.crummy.com/software/Beaut...mentation.html

Are there any parameters or other mechanisms to pass the markupMassage list to index_to_soup? I have notices that the function does something with the makupMassage Parameter when generating the beautiful soup:

Code:
  massage = list(BeautifulSoup.MARKUP_MASSAGE)
        enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, encoding=enc)))
        return BeautifulSoup(_raw, markupMassage=massage)
So there must be some way of passing my personal markupMassage list to index_to_soup?

Thanks,

Jens
oecherprinte is offline   Reply With Quote
Old 11-04-2010, 05:37 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by oecherprinte View Post
Are there any parameters or other mechanisms to pass the markupMassage list to index_to_soup? I have notices that the function does something with the makupMassage Parameter when generating the beautiful soup:

Code:
  massage = list(BeautifulSoup.MARKUP_MASSAGE)
        enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, encoding=enc)))
        return BeautifulSoup(_raw, markupMassage=massage)
So there must be some way of passing my personal markupMassage list to index_to_soup?

Thanks,

Jens
Look at news.py for the def of index_to_soup.
I'll paste it here:
Spoiler:
Code:
    def index_to_soup(self, url_or_raw, raw=False):
        '''
        Convenience method that takes an URL to the index page and returns
        a `BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/documentation.html>`_
        of it.

        `url_or_raw`: Either a URL or the downloaded index page as a string
        '''
        if re.match(r'\w+://', url_or_raw):
            open_func = getattr(self.browser, 'open_novisit', self.browser.open)
            with closing(open_func(url_or_raw)) as f:
                _raw = f.read()
            if not _raw:
                raise RuntimeError('Could not fetch index from %s'%url_or_raw)
        else:
            _raw = url_or_raw
        if raw:
            return _raw
        if not isinstance(_raw, unicode) and self.encoding:
            if callable(self.encoding):
                _raw = self.encoding(_raw)
            else:
                _raw = _raw.decode(self.encoding, 'replace')
        massage = list(BeautifulSoup.MARKUP_MASSAGE)
        enc = 'cp1252' if callable(self.encoding) or self.encoding is None else self.encoding
        massage.append((re.compile(r'&(\S+?);'), lambda match:
            entity_to_unicode(match, encoding=enc)))
        return BeautifulSoup(_raw, markupMassage=massage)

You should be able to modify the massage = list for whatever you need.
Starson17 is offline   Reply With Quote
Old 11-04-2010, 06:26 PM   #3
oecherprinte
Zealot
oecherprinte began at the beginning.
 
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
@Starson17:

Thanks. Of course I could change the index_to_soup function.
But I have the impression that I can set the variable

BeautifulSoup.MARKUP_MASSAGE

somewhere externally e.g. in the header of the recipe?
oecherprinte is offline   Reply With Quote
Old 11-05-2010, 07:07 AM   #4
oecherprinte
Zealot
oecherprinte began at the beginning.
 
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
It's a nightmare. My first contact to python is really messy. What terrible language to debug ...

Anyway, I tried fumbling around with setting a "BeautifulSoup.MARKUP_MASSAGE" variable in my recipe which did not work. Then I tried to copy the "index_to_soup" function into my recipe and renamed it to "my_index_to_soup". I copied all the imports from the "calibre.web.feeds.news" file. Now I get the error message:

"ValueError: too many values to unpack"

for the line "return BeautifulSoup(_raw, markupMassage=massage)"

without even touching the code. I am giving up now.

Isn't there an easy way to automatically remove erroneous html code from a file before transferring it into beautiful soup? Maybe the developer could help (by the way: I already donated via paypal last week :-) ). I could imagine that many recipe programmers are facing this problem.

Cheers,

Jens

P.S.: My specific problem is the line "<!#BeginList>" in the html file which makes beautiful soup think that the remainder of the file is a single tag ... (sigh)

Last edited by oecherprinte; 11-05-2010 at 07:13 AM.
oecherprinte is offline   Reply With Quote
Old 11-05-2010, 08:34 AM   #5
oecherprinte
Zealot
oecherprinte began at the beginning.
 
Posts: 115
Karma: 20
Join Date: Jul 2010
Device: Kindle3 3G, Kindle Paperwhite 2
Nevermind. I just solved the problem:

As I said I just copied the index_to_soup function into my new recipe and renamed it to my_index_to_soup. Then I added the following lines before the "return" statement:
Code:
        #remove erroneous strings from input file
        massage.append((re.compile("<!#BeginList>"), lambda match:''))
        massage.append((re.compile("<!#EndList>"), lambda match:''))
and voila, the junk is removed ...
oecherprinte is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Merging multiple HTML files into one HTML file skoobwoman Workshop 45 07-11-2014 11:46 AM
Calibre Recipe HTML content differs from raw html of index.html. krunk Calibre 4 09-20-2010 10:48 PM
HTML Book + non HTML TOC to epub aarcane Calibre 4 03-02-2010 03:58 AM
Can we do this in HTML? Nate the great Workshop 17 08-04-2009 12:02 PM
HTML How to? jlbfoot Sony Reader 1 12-27-2008 12:51 PM


All times are GMT -4. The time now is 09:21 PM.


MobileRead.com is a privately owned, operated and funded community.