MobileRead Forums - View Single Post

KevinH · 09-20-2014, 11:24 AM

Hi,

I have been somewhat able to duplicate the "eating text" issue but I had to do something really horrible to get Tidy to actually "eat text". I had to confuse it as to what is text and what is tag.

Code:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<pclass="bot" Now is the time for all good men to come to the aid of the party 3.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>

Notice the missing ">" to mark the end of the bad <pclass="bot" tag. Unmatched or missing ">" and "<" will always drive cleaning programs insane as it can't tell what is text and what is tag.

Here is what Tidy in Sigil did to this file on open:

Code:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>

A better solution when unpaired "<" ">" exists is to simply replace the unbalanced one with its html entity codes

Code:

 "&lt;"  or "&gt;"

to prevent anything from being lost.

The problem is Tidy is really a mess to try and fix or change. So I think the only solution in the long run is to write a much simpler replacement for Tidy, that defaults to creating too much "text" as opposed to deleting any.

If I get a free moment, I may take a stab at a prototype for doing this in python to create a sort of "safe clean" plugin to see if it actually is doable and would help.

Parsing bad xhtml especially with unmatched "<" and ">" is fraught with issues as it can confuse the hell out of the parser.

A "safe clean" parser would create the following output for that example:

Code:

&lt;pclass="bot" Now is the time for all good men to come to the aid of the party 3.<p></p>

Not great but still easier to fix after the fact with no text lost.

Would this be a help?

KevinH

09-20-2014, 11:24 AM	#27
KevinH Sigil Developer Posts: 8,915 Karma: 6120478 Join Date: Nov 2009 Device: many	Hi, I have been somewhat able to duplicate the "eating text" issue but I had to do something really horrible to get Tidy to actually "eat text". I had to confuse it as to what is text and what is tag. Code: <?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"> <head> <title/> <link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head> <body> <h1>Test</h1> <p>Now is the time for all good men to come to the aid of the party 1.</p> <p>Now is the time for all good men to come to the aid of the party 2.</p> <pclass="bot" Now is the time for all good men to come to the aid of the party 3.</p> <p>Now is the time for all good men to come to the aid of the party 4.</p> </body> </html> Notice the missing ">" to mark the end of the bad <pclass="bot" tag. Unmatched or missing ">" and "<" will always drive cleaning programs insane as it can't tell what is text and what is tag. Here is what Tidy in Sigil did to this file on open: Code: <?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"> <head> <title/> <link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head> <body> <h1>Test</h1> <p>Now is the time for all good men to come to the aid of the party 1.</p> <p>Now is the time for all good men to come to the aid of the party 2.</p> <p>Now is the time for all good men to come to the aid of the party 4.</p> </body> </html> A better solution when unpaired "<" ">" exists is to simply replace the unbalanced one with its html entity codes Code: "<" or ">" to prevent anything from being lost. The problem is Tidy is really a mess to try and fix or change. So I think the only solution in the long run is to write a much simpler replacement for Tidy, that defaults to creating too much "text" as opposed to deleting any. If I get a free moment, I may take a stab at a prototype for doing this in python to create a sort of "safe clean" plugin to see if it actually is doable and would help. Parsing bad xhtml especially with unmatched "<" and ">" is fraught with issues as it can confuse the hell out of the parser. A "safe clean" parser would create the following output for that example: Code: <pclass="bot" Now is the time for all good men to come to the aid of the party 3.<p></p> Not great but still easier to fix after the fact with no text lost. Would this be a help? KevinH Last edited by KevinH; 09-20-2014 at 11:46 AM.