View Single Post
Old 09-20-2014, 02:37 PM   #28
JimmyG
Zealot
JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.JimmyG solves Fermat’s last theorem while doing the crossword.
 
Posts: 119
Karma: 28454
Join Date: Apr 2011
Location: Yuma, AZ
Device: Kindle Touch, Voyage
Quote:
Originally Posted by KevinH View Post
Hi,

I have been somewhat able to duplicate the "eating text" issue but I had to do something really horrible to get Tidy to actually "eat text". I had to confuse it as to what is text and what is tag.

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<pclass="bot" Now is the time for all good men to come to the aid of the party 3.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>
Notice the missing ">" to mark the end of the bad <pclass="bot" tag. Unmatched or missing ">" and "<" will always drive cleaning programs insane as it can't tell what is text and what is tag.

Here is what Tidy in Sigil did to this file on open:

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>

A better solution when unpaired "<" ">" exists is to simply replace the unbalanced one with its html entity codes
Code:
 "&lt;"  or "&gt;"
to prevent anything from being lost.

The problem is Tidy is really a mess to try and fix or change. So I think the only solution in the long run is to write a much simpler replacement for Tidy, that defaults to creating too much "text" as opposed to deleting any.

If I get a free moment, I may take a stab at a prototype for doing this in python to create a sort of "safe clean" plugin to see if it actually is doable and would help.

Parsing bad xhtml especially with unmatched "<" and ">" is fraught with issues as it can confuse the hell out of the parser.

A "safe clean" parser would create the following output for that example:

Code:
&lt;pclass="bot" Now is the time for all good men to come to the aid of the party 3.<p></p>
Not great but still easier to fix after the fact with no text lost.

Would this be a help?

KevinH
Does Tidy come into play, even tho I have it turned off?

Your example does not show the problem I have found. Every file I have ever lost, in 7.4 and 7.7 does the same thing. From some apparently arbitrary point (perhaps a mistake?) in the file, it removes everything from that point up to </body> and rewrites the ending to </body></html> without the line break.
JimmyG is offline   Reply With Quote