View Single Post
Old 09-20-2014, 11:24 AM   #27
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,915
Karma: 6120478
Join Date: Nov 2009
Device: many
Hi,

I have been somewhat able to duplicate the "eating text" issue but I had to do something really horrible to get Tidy to actually "eat text". I had to confuse it as to what is text and what is tag.

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<pclass="bot" Now is the time for all good men to come to the aid of the party 3.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>
Notice the missing ">" to mark the end of the bad <pclass="bot" tag. Unmatched or missing ">" and "<" will always drive cleaning programs insane as it can't tell what is text and what is tag.

Here is what Tidy in Sigil did to this file on open:

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link href="../Styles/Style0001.css" rel="stylesheet" type="text/css"/></head>
<body>
<h1>Test</h1>

<p>Now is the time for all good men to come to the aid of the party 1.</p>

<p>Now is the time for all good men to come to the aid of the party 2.</p>

<p>Now is the time for all good men to come to the aid of the party 4.</p>

</body>
</html>

A better solution when unpaired "<" ">" exists is to simply replace the unbalanced one with its html entity codes
Code:
 "&lt;"  or "&gt;"
to prevent anything from being lost.

The problem is Tidy is really a mess to try and fix or change. So I think the only solution in the long run is to write a much simpler replacement for Tidy, that defaults to creating too much "text" as opposed to deleting any.

If I get a free moment, I may take a stab at a prototype for doing this in python to create a sort of "safe clean" plugin to see if it actually is doable and would help.

Parsing bad xhtml especially with unmatched "<" and ">" is fraught with issues as it can confuse the hell out of the parser.

A "safe clean" parser would create the following output for that example:

Code:
&lt;pclass="bot" Now is the time for all good men to come to the aid of the party 3.<p></p>
Not great but still easier to fix after the fact with no text lost.

Would this be a help?

KevinH

Last edited by KevinH; 09-20-2014 at 11:46 AM.
KevinH is offline   Reply With Quote