View Single Post
Old 04-07-2014, 12:27 PM   #625
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Quote:
Originally Posted by Rev. Bob View Post
I've made that change, and I've found another bug: if HR, BR, or IMG are coded as no-content containers rather than self-closing elements (stupid, but legal), the closing tags are removed but the opening tag is not converted to self-closing.

In other words, <hr></hr> is truncated to a bad <hr> instead of converted to a correct <hr/>.

The culprit seems to be the logic in lines 590-591 of the attached version's modify.py, in which those elements are always assumed to be self-closing:

Code:
elif entity[:3] == '<hr' or entity[:3] == '<br' or entity[:4] == '<img':
    this_entity.e_type = 3
To dodge that bug, I've simply commented that test out for now. Thus, those elements are tested like every other element, and the bad-but-okay form is preserved - but it would be nice if <foo a="x" b="y"></foo> could be converted to <foo a="x" b="y"/> across the board. I'm just not sure how to modify your code to do so.
Doing a quick test (and having to research) the last truncation can be done quite simply...
Code:
#!/usr/bin/env python

import re

result = re.sub(r'(<(.*)[^>]+)></\2>', r'\1/>', '<foo a="x" b="y"></foo>')
print result
Perkin is offline   Reply With Quote