MobileRead Forums - View Single Post

JimmyG · 09-20-2014, 02:37 PM

Quote:

Originally Posted by KevinH

Hi,

I have been somewhat able to duplicate the "eating text" issue but I had to do something really horrible to get Tidy to actually "eat text". I had to confuse it as to what is text and what is tag.

Code:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<pclass="bot" Now is the time for all good men to come to the aid of the party 3.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>

Notice the missing ">" to mark the end of the bad <pclass="bot" tag. Unmatched or missing ">" and "<" will always drive cleaning programs insane as it can't tell what is text and what is tag.

Here is what Tidy in Sigil did to this file on open:

Code:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>

A better solution when unpaired "<" ">" exists is to simply replace the unbalanced one with its html entity codes

Code:

 "&lt;"  or "&gt;"

to prevent anything from being lost.

The problem is Tidy is really a mess to try and fix or change. So I think the only solution in the long run is to write a much simpler replacement for Tidy, that defaults to creating too much "text" as opposed to deleting any.

If I get a free moment, I may take a stab at a prototype for doing this in python to create a sort of "safe clean" plugin to see if it actually is doable and would help.

Parsing bad xhtml especially with unmatched "<" and ">" is fraught with issues as it can confuse the hell out of the parser.

A "safe clean" parser would create the following output for that example:

Code:

&lt;pclass="bot" Now is the time for all good men to come to the aid of the party 3.<p></p>

Not great but still easier to fix after the fact with no text lost.

Would this be a help?

KevinH

Does Tidy come into play, even tho I have it turned off?

Your example does not show the problem I have found. Every file I have ever lost, in 7.4 and 7.7 does the same thing. From some apparently arbitrary point (perhaps a mistake?) in the file, it removes everything from that point up to </body> and rewrites the ending to </body></html> without the line break.