Buglet?

Thasaidon · 10-06-2020, 02:40 AM

I have just noticed.

I was editing a book and had got near the end. I went to "remove unused stylesheet classes". This refused sayingr that the html was not well formed,

I then ran the well-formed check epub (F7). This produced no error. On further checking I found I had a "<<" in one of the files.

Shouldn't the well-formed check epub (F7) pick up such an error like the "remove unused stylesheet classes"?

DNSB · 10-06-2020, 11:49 AM

The F7 is a very basic check. It's limitations are why I have epubcheck and FlightCrew installed (on epub2, Flightcrew saves me from having to do a separate check for unused files).

KevinH · 10-06-2020, 03:50 PM

It should detect it. Preview should also have detected it. Please copy the exact xhtml (with the error) and zip it up and post it. I will try to see why the well-formed sanity check did not detect it and fix it.

Thanks,

KevinH

feel free to change the actual letters to gibberish if needed.

Doitsu · 10-06-2020, 04:22 PM

Quote:

Originally Posted by KevinH

It should detect it.

I just added an additional angle bracket to a > tag and F7 didn't complain about it.

Spoiler:

When I added it before <, it also wasn't flagged.

JSWolf · 10-06-2020, 04:29 PM

Quote:

Originally Posted by Doitsu

I just added an additional angle bracket to a > tag and F7 didn't complain about it.

Spoiler:

When I added it before <, it also wasn't flagged.

epubcheck does not catch >.. Is > really an error?

DiapDealer · 10-06-2020, 07:41 PM

Quote:

Originally Posted by JSWolf

epubcheck does not catch >.. Is > really an error?

So long as gumbo has been allowed to change the extraneous > to an entity, then no, it's not (other than potential naked text outside of tags). Plus I have no idea if epubcheck concerns itself with (x)html(5) well-formedness strictures.

Without diving into it, my guess here is that gumbo is "fixing" the extra angle-bracket before the internal well-formed check is performed, whereas that's not happening with the "Remove Unused css Classes" feature. It's possible that something is (or isn't) getting flushed to disk before one or the other of those activities.

DiapDealer · 10-06-2020, 07:46 PM

Wow! That's weird. Preview doesn't bomb with > but it does with <<p. Sumpin's up!

DiapDealer · 10-06-2020, 07:48 PM

It's being converted to an entity somewhere. When I Edit as Html with the inspector, I can see the entity..

Tex2002ans · 10-06-2020, 08:02 PM

If you test it in W3C's Validation Service:

https://validator.w3.org/#validate_by_input

And give it XHTML with a ">":

Code:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title></title>
</head>

<body>
  <p>Test</p>
  <p>And here's an error.</p>>
</body>
</html>

you get a "character data is not allowed here" error.

If you feed it similar in HTML:

Code:

<!DOCTYPE html>
<html>
<head>
  <title></title>
</head>

<body>
  <p>Test</p>
  <p>And here's an error.</p>>
</body>

no such error. It thinks it's fine...

If you do "<" instead, both the XHTML1.1 + HTML5 checkers ping it.

Must be something obscure/weird in the HTML spec. Reminds me when I found that bug with the accidental <p">, and KevinH tracked it down. Turns out such a thing IS valid in HTML... but extremely poor practice.

KevinH · 10-06-2020, 09:50 PM

Okay I checked the python3lib sanitycheck.py code and it will treat "<" as a spurious text "<" followed by a tag. And it will treat ">" or ">" as a tag followed by a spurious text ">".

I could detect both cases by verifying that the text returned from parsing does not contains an illegal > or < char when not a child of a CDATA tag.

So making sanity check detect these cases is doable. I will look into doing that.

FWIW, HTML5 parsing rules only require xml escaping a ">" in text if it would be considered to result in ambiguous parsing. Whereas the "<" character should always be xml escaped when used in attribute values and text. Under XHTML, both characters should always be xml escaped when used inside attribute values and text fields.

KevinH · 10-07-2020, 11:38 AM

This is now fixed in master. Well-Formed Check (sanitycheck.py) will now look for and detect missing xml escaping on '>' and '<' chars in text fields. So it will detect both '<', '>', and '>' cases (of course on any tag).

Thank you for the bug report and helping to improve Sigil!

10-06-2020, 02:40 AM	#1
Thasaidon Hedge Wizard Posts: 802 Karma: 19999999 Join Date: May 2011 Location: UK/Philippines Device: Kobo Touch, Nook Simple	Buglet? I have just noticed. I was editing a book and had got near the end. I went to "remove unused stylesheet classes". This refused sayingr that the html was not well formed, I then ran the well-formed check epub (F7). This produced no error. On further checking I found I had a "<<" in one of the files. Shouldn't the well-formed check epub (F7) pick up such an error like the "remove unused stylesheet classes"?

10-06-2020, 07:46 PM	#7
DiapDealer Grand Sorcerer Posts: 28,915 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Wow! That's weird. Preview doesn't bomb with </p>> but it does with <<p. Sumpin's up!

10-06-2020, 08:02 PM	#9
Tex2002ans Wizard Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook	If you test it in W3C's Validation Service: https://validator.w3.org/#validate_by_input And give it XHTML with a "</p>>": Code: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <p>Test</p> <p>And here's an error.</p>> </body> </html> you get a "character data is not allowed here" error. If you feed it similar in HTML: Code: <!DOCTYPE html> <html> <head> <title></title> </head> <body> <p>Test</p> <p>And here's an error.</p>> </body> no such error. It thinks it's fine... If you do "<<p>" instead, both the XHTML1.1 + HTML5 checkers ping it. Must be something obscure/weird in the HTML spec. Reminds me when I found that bug with the accidental <p">, and KevinH tracked it down. Turns out such a thing IS valid in HTML... but extremely poor practice. Last edited by Tex2002ans; 10-06-2020 at 08:06 PM.

10-06-2020, 09:50 PM	#10
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	Okay I checked the python3lib sanitycheck.py code and it will treat "<<p>" as a spurious text "<" followed by a tag. And it will treat "</p>>" or "<p>>" as a tag followed by a spurious text ">". I could detect both cases by verifying that the text returned from parsing does not contains an illegal > or < char when not a child of a CDATA tag. So making sanity check detect these cases is doable. I will look into doing that. FWIW, HTML5 parsing rules only require xml escaping a ">" in text if it would be considered to result in ambiguous parsing. Whereas the "<" character should always be xml escaped when used in attribute values and text. Under XHTML, both characters should always be xml escaped when used inside attribute values and text fields. Last edited by KevinH; 10-06-2020 at 09:59 PM. Reason: updating

10-07-2020, 11:38 AM	#11
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	This is now fixed in master. Well-Formed Check (sanitycheck.py) will now look for and detect missing xml escaping on '>' and '<' chars in text fields. So it will detect both '<<p>', '<p>>', and '</p>>' cases (of course on any tag). Thank you for the bug report and helping to improve Sigil!

10-06-2020, 11:49 AM	#2
DNSB Bibliophagist Posts: 48,653 Karma: 174510110 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos	The F7 is a very basic check. It's limitations are why I have epubcheck and FlightCrew installed (on epub2, Flightcrew saves me from having to do a separate check for unused files).

10-06-2020, 03:50 PM	#3
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	It should detect it. Preview should also have detected it. Please copy the exact xhtml (with the error) and zip it up and post it. I will try to see why the well-formed sanity check did not detect it and fix it. Thanks, KevinH feel free to change the actual letters to gibberish if needed.

10-06-2020, 07:48 PM	#8
DiapDealer Grand Sorcerer Posts: 28,915 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	It's being converted to an entity somewhere. When I Edit as Html with the inspector, I can see the entity..

Advert

Advert