View Single Post
Old 04-07-2014, 03:35 PM   #634
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
I suppose you could...
In the strip_span_for_page() add the line
Code:
html_text = re.sub(r'<([^>]+)></\1>', '', html_text)
OR
Code:
html_text = re.sub(r'(<(.*)[^>]+)></\2>', r'\1/>', html_text)
before the line

Code:
            entities = re.split(r'(<.+?>)', html_text)
The first will strip them completely, the second would turn them into self-closing tags, which you could then catch later, with your 'if equals...'

I'm trying to think if there's any tags which this would strip, that you shouldn't strip.
Are there any?
Perkin is offline   Reply With Quote