Hi,
I have been somewhat able to duplicate the "eating text" issue but I had to do something really horrible to get Tidy to actually "eat text". I had to confuse it as to what is text and what is tag.
Code:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>
<p>Now is the time for all good men to come to the aid of the party 1.</p>
<p>Now is the time for all good men to come to the aid of the party 2.</p>
<pclass="bot" Now is the time for all good men to come to the aid of the party 3.</p>
<p>Now is the time for all good men to come to the aid of the party 4.</p>
</body>
</html>
Notice the missing ">" to mark the end of the bad <pclass="bot" tag. Unmatched or missing ">" and "<" will always drive cleaning programs insane as it can't tell what is text and what is tag.
Here is what Tidy in Sigil did to this file on open:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>
<p>Now is the time for all good men to come to the aid of the party 1.</p>
<p>Now is the time for all good men to come to the aid of the party 2.</p>
<p>Now is the time for all good men to come to the aid of the party 4.</p>
</body>
</html>
A better solution when unpaired "<" ">" exists is to simply replace the unbalanced one with its html entity codes
to prevent anything from being lost.
The problem is Tidy is really a mess to try and fix or change. So I think the only solution in the long run is to write a much simpler replacement for Tidy, that defaults to creating too much "text" as opposed to deleting any.
If I get a free moment, I may take a stab at a prototype for doing this in python to create a sort of "safe clean" plugin to see if it actually is doable and would help.
Parsing bad xhtml especially with unmatched "<" and ">" is fraught with issues as it can confuse the hell out of the parser.
A "safe clean" parser would create the following output for that example:
Code:
<pclass="bot" Now is the time for all good men to come to the aid of the party 3.<p></p>
Not great but still easier to fix after the fact with no text lost.
Would this be a help?
KevinH