Quote:
Originally Posted by Perkin
The first will strip them completely, the second would turn them into self-closing tags, which you could then catch later, with your 'if equals...'
I'm trying to think if there's any tags which this would strip, that you shouldn't strip.
Are there any?
|
The first would strip out all self-closing tags, including images...which is bad. Removing all elements without content could also be bad; the HEAD element must contain a TITLE, but the TITLE can be empty. The second looks promising, but I'm leery of the referencing and some of the matching seems to be off. I think this might be closer:
Code:
html_text = re.sub(r'<(\S+)([^>]*?)></\1>', r'<\1\2/>', html_text)
html_text = re.sub(r'<([^>]*?)(\s+?)/>', r'<\1/>', html_text)
html_text = re.sub(r'<(b|i|u|a|span)/>', r'', html_text)
That is, "find a tag that starts with at least one non-space character and may contain more than that, immediately followed by its closing tag" and do the replacement. The second line should delete any whitespace immediately before the '/>' mark - thus making it possible to search for and remove attributeless self-closed elements with the third line. Those would go right at the beginning of strip_span_for_page(), somewhere before the re.split line.
I think <a/>, <b/>, <i/>, <u/>, and <span/> should cover it; can anyone think of any other empty tags that would need to be stripped out?