Sigil 2.6.2 - help with html error

Derell Licht · Yesterday, 12:42 PM

As I posted in a recent thread, I got an error message while loading an html file into Sigil. The message said "line 30: Expected '>' or '/', but got ':'.

I traced this back to a script block which displayed a license notice (shown below). When I deleted this script block, the file loaded into Sigil with no problems, and I was able to convert it to epub with no further issues.

However, this block does *not* appear to contain any colons?!?! Line 30 is at the @licend line... why was I getting this error??

Code:

<script  nonce="1103b75ea8a534e00bd01d677f2ea330" >
/* @licstart  The following is the entire license notice for the
 * JavaScript code in this page.
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU Affero General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Affero General Public License for more details.
 *
 * You should have received a copy of the GNU Affero General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 *
 * @licend  The above is the entire license notice
 * for the JavaScript code in this page.
 */
</script>

Derell Licht

KevinH · Yesterday, 02:10 PM

'/* ' is not an xhtml comment and urls are not self-closed:
'see <http://www.gnu.org/licenses/>'

All of this should have been inside a separate javascript file and not inlined. Or all of the < and > characters in that comment would have to be xml escaped to be included in xhtml.

KevinH · Yesterday, 02:12 PM

Please do not open a new thread for every reply. Simply add a new comment to the existing thread.

Derell Licht · Yesterday, 02:27 PM

Quote:

Originally Posted by KevinH

Please do not open a new thread for every reply. Simply add a new comment to the existing thread.

Okay, sorry about that; I felt that this was a different question than the previous thread ('why did this block fail??', as opposed to 'what is causing that error?'), but the distinction is probably obtuse, and I could have merged them together.

Derell Licht · Yesterday, 02:29 PM

So you are saying that licstart/licend block is not valid?? It was inserted by Internet Archive, I would have expected that they knew how to make their headers... but it appears that isn't always the case...

I will consider this thread answered, and it can be closed.

KevinH · Yesterday, 03:06 PM

Note that html is much much more accepting of embedded < and > chars and has much less strict syntax but requires a very forgiving and specialized parser. So what the internet archive gets away with in spaghetti html is not always valid in an epub.

The epub spec uses much stricter xhtml/xml parsing rules and therefore can use a very fast and simpler parser than html.

Sigil's Mend is an a forgiving html parser that can create valid xhtml. Which is why it is recommended to enable Mend when opening html files.

Derell Licht · Yesterday, 05:25 PM

However, this *does* lead back to my other question...

Code:

Mend Not Well Formed HTML Source code On:

is *already* set for both Open and Save, but I am still getting this error...

KevinH · Yesterday, 06:39 PM

Probably because that open and close is referring to Sigil just on opening an epub and on saving an epub. It looks more like you are trying to import a standalone html file that is not well formed. Just edit the file and remove that script open and close tag and replace it with an xhtml comment if you want the license info. Otherwise remove that script tag and its contents completely.

Another option is to open that html file in a text editor and copy and paste it into a blank xhtml document inside Sigil then run Mend on it.

Probably easier just to remove that bloody script tag and its contents, though.

KevinH · Yesterday, 06:56 PM

I just took a look at the code that uses AddExisting to load an html file and it should detect it is not well formed, alert you with that info, if you have mend on open set, it should repair it and add it to your current epub.

If you have Sigil preferences set to clean/mend on open, you should see the fixed file in Sigil's BookBrowser after you dismiss the error alert.

So it sounds like this is not working. It could be a bug. Would you please zip up a copy of that html file you want and attach it to your reply to this post, so we can use it to try to recreate your error, and fix it if needed.

Are you sure you do not see the mended version of that file in Bookbrowser after you dismiss the error alert?

Derell Licht · Today, 09:11 AM

Quote:

Originally Posted by KevinH

So it sounds like this is not working. It could be a bug. Would you please zip up a copy of that html file you want and attach it to your reply to this post, so we can use it to try to recreate your error, and fix it if needed.

Are you sure you do not see the mended version of that file in Bookbrowser after you dismiss the error alert?

Okay, here is a zip file with the failing html document...
I just checked this morning, and Sigil definitely is not opening the file at all.

BTW, I *did* simply delete that entire script block, which allowed me to import the file and generate an epub from it, as I wished...

However, I decided to pursue these discussions here, so that I would have a better understanding of what the problems were, because I don't have much experience with xml or epub formats at all...

Thank you for all of your insights here!!

Later note:
I don't seem to be able to attach a file here, and "Go Advanced" isn't working for me, so I just dropped the file onto my website, you can download it from there:
https://derelllicht.42web.io/files/ldsv.testing.zip

KevinH · Today, 09:38 AM

I grabbed it, thank you. If I can recreate the error with it, then I should be able to track it down and fix it.

KevinH · Today, 09:54 AM

There is a bug in Sigil's ImportHTML module that occurs when there is no fix found.
I will hopefully get that fixed for the next release.

For the record, the correct fix is the following:

Code:

 * You should have received a copy of the GNU Affero General Public License
 * along with this program.  If not, see &lt;http://www.gnu.org/licenses/&gt;.

Notice that in the javascript comments I needed to xml escape the '<' and '>' characters to their xml entities < and > respectively, in that license statement (something about the 'http:" is preventing the fix from being applied.

Also for what is is worth. That is a huge file. For performance sake especially for old ereaders you should insert Sigil split markers at proper demarcation points and split this file into many separate chapter or sections or whatever. Some old epub2 only e-readers could only use 320k of xhtml/html before slowing down (or even crashing) so this has becomes a reasonable maximum file size for epub chapters.

Hope this explains things. I will try to track down and fix the import bug.

KevinH · Today, 11:54 AM

It turns out that

/* <http://www.gnu.org/licenses/> */

inside a script tag is enough to break even Google's gumbo parser and prevent it from "fixing" this file.

I am not sure if this is worth a "fix" or not as using illegal '<' and '>' tags inside a javascript comment inside a script tag is not a good idea. Instead using a normal xhtml comment to include this information or include it as a separate javascript file is much better.

It is just something I have never seen before and not something Google tested in its huge testing effort on literally millions of websites.

An xhtml parser would have to be able to successfully parse all of javascript to even detect this is a comment and not code.

Making gumbo "change" javascript inside a script tag is not something we want to do as it is simply a bad idea.

I think instead we will chalk this up as "a really dumb thing to do in general" but something html parsing will accept but not something anyone would ever want in an epub as there is no way for a normal e-reader user to ever see this license.

We can revisit this in the future if ever needed.

Thank you for your interesting test case.

Derell Licht · Today, 12:07 PM

Quote:

Originally Posted by KevinH

We can revisit this in the future if ever needed.

Thank you for your interesting test case.

I agree that I don't think any code modification is worth the effort in this edge case... I am quite happy just to have all this insight into what is going on in this example!! Thank you very much!

Derell Licht · Today, 12:19 PM

In case you are curious about how this html file came to be, it is part of my new technique for converting PDF documents/books to epub...

In the past I would run the PDF through an OCR converter, which generated a text file (that usually required a ton of cleanup). The biggest headache of the resulting text file was that all the lines ended up with hard CR/LF characters at end of every line, which needed to be removed if I wanted to make the pages flow smoothly with changing screen dimensions.

But it recently occurred to me that if I just wrap basic html constructs around the text file (html, head, title, body), then the newline issue completely vanishes, because html and derivatives ignore those breaks!! So all I have to do then is walk through the file, deleting all hard-coded pagination lines, insert <p> at end of each paragraph, and I'm done; just import into Sigil to generate the epub, and I'm ready to publish...

My mistake here, was that I wanted to retain Internet Archive's signatures, so anyone looking at the code would know where I got it... so I took that header from some other file on the IA page (for this book) and imported into my document... but I didn't realize until now that I had some traps to look out for !!

I also wasn't aware of the issues with a large html file, which you pointed out to me here... I just went back and added page breaks at all the new-chapter points.

Yesterday, 02:10 PM	#2
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	'/* ' is not an xhtml comment and urls are not self-closed: 'see <http://www.gnu.org/licenses/>' All of this should have been inside a separate javascript file and not inlined. Or all of the < and > characters in that comment would have to be xml escaped to be included in xhtml. Last edited by KevinH; Yesterday at 02:13 PM.

Yesterday, 05:25 PM	#7
Derell Licht Member Posts: 21 Karma: 10 Join Date: Jul 2016 Location: Fremont, CA Device: Kindle Paperwhite Signature Edition	However, this does lead back to my other question... Code: Mend Not Well Formed HTML Source code On: is already set for both Open and Save, but I am still getting this error...

Today, 09:54 AM	#12
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	There is a bug in Sigil's ImportHTML module that occurs when there is no fix found. I will hopefully get that fixed for the next release. For the record, the correct fix is the following: Code: * You should have received a copy of the GNU Affero General Public License * along with this program. If not, see <http://www.gnu.org/licenses/>. Notice that in the javascript comments I needed to xml escape the '<' and '>' characters to their xml entities < and > respectively, in that license statement (something about the 'http:" is preventing the fix from being applied. Also for what is is worth. That is a huge file. For performance sake especially for old ereaders you should insert Sigil split markers at proper demarcation points and split this file into many separate chapter or sections or whatever. Some old epub2 only e-readers could only use 320k of xhtml/html before slowing down (or even crashing) so this has becomes a reasonable maximum file size for epub chapters. Hope this explains things. I will try to track down and fix the import bug. Last edited by KevinH; Today at 09:57 AM.

Today, 12:19 PM	#15
Derell Licht Member Posts: 21 Karma: 10 Join Date: Jul 2016 Location: Fremont, CA Device: Kindle Paperwhite Signature Edition	In case you are curious about how this html file came to be, it is part of my new technique for converting PDF documents/books to epub... In the past I would run the PDF through an OCR converter, which generated a text file (that usually required a ton of cleanup). The biggest headache of the resulting text file was that all the lines ended up with hard CR/LF characters at end of every line, which needed to be removed if I wanted to make the pages flow smoothly with changing screen dimensions. But it recently occurred to me that if I just wrap basic html constructs around the text file (html, head, title, body), then the newline issue completely vanishes, because html and derivatives ignore those breaks!! So all I have to do then is walk through the file, deleting all hard-coded pagination lines, insert <p> at end of each paragraph, and I'm done; just import into Sigil to generate the epub, and I'm ready to publish... My mistake here, was that I wanted to retain Internet Archive's signatures, so anyone looking at the code would know where I got it... so I took that header from some other file on the IA page (for this book) and imported into my document... but I didn't realize until now that I had some traps to look out for !! I also wasn't aware of the issues with a large html file, which you pointed out to me here... I just went back and added page breaks at all the new-chapter points. Last edited by Derell Licht; Today at 01:09 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Sigil 2.6.2 - html open fails, do not understand error message	Derell Licht	Sigil	2	Yesterday 02:20 PM
Sigil Error Message: Book File Would Not Open in Sigil	fkustaa	Sigil	9	04-27-2025 05:11 AM
Refresh files in Sigil when html files have changed outside Sigil	Echeban	Sigil	43	10-29-2021 08:29 PM
After changes to HTML in Sigil...	Education	Sigil	24	03-18-2014 10:39 AM
Sigil loses all text after an html error	grumbles	Sigil	3	05-13-2010 10:28 AM

Yesterday, 02:12 PM	#3
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	Please do not open a new thread for every reply. Simply add a new comment to the existing thread.

Yesterday, 02:29 PM	#5
Derell Licht Member Posts: 21 Karma: 10 Join Date: Jul 2016 Location: Fremont, CA Device: Kindle Paperwhite Signature Edition	So you are saying that licstart/licend block is not valid?? It was inserted by Internet Archive, I would have expected that they knew how to make their headers... but it appears that isn't always the case... I will consider this thread answered, and it can be closed.

Yesterday, 03:06 PM	#6
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	Note that html is much much more accepting of embedded < and > chars and has much less strict syntax but requires a very forgiving and specialized parser. So what the internet archive gets away with in spaghetti html is not always valid in an epub. The epub spec uses much stricter xhtml/xml parsing rules and therefore can use a very fast and simpler parser than html. Sigil's Mend is an a forgiving html parser that can create valid xhtml. Which is why it is recommended to enable Mend when opening html files.

Yesterday, 06:39 PM	#8
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	Probably because that open and close is referring to Sigil just on opening an epub and on saving an epub. It looks more like you are trying to import a standalone html file that is not well formed. Just edit the file and remove that script open and close tag and replace it with an xhtml comment if you want the license info. Otherwise remove that script tag and its contents completely. Another option is to open that html file in a text editor and copy and paste it into a blank xhtml document inside Sigil then run Mend on it. Probably easier just to remove that bloody script tag and its contents, though.

Yesterday, 06:56 PM	#9
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	I just took a look at the code that uses AddExisting to load an html file and it should detect it is not well formed, alert you with that info, if you have mend on open set, it should repair it and add it to your current epub. If you have Sigil preferences set to clean/mend on open, you should see the fixed file in Sigil's BookBrowser after you dismiss the error alert. So it sounds like this is not working. It could be a bug. Would you please zip up a copy of that html file you want and attach it to your reply to this post, so we can use it to try to recreate your error, and fix it if needed. Are you sure you do not see the mended version of that file in Bookbrowser after you dismiss the error alert?

Today, 09:38 AM	#11
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	I grabbed it, thank you. If I can recreate the error with it, then I should be able to track it down and fix it.

Today, 11:54 AM	#13
KevinH Sigil Developer Posts: 9,026 Karma: 6361556 Join Date: Nov 2009 Device: many	It turns out that /* <http://www.gnu.org/licenses/> */ inside a script tag is enough to break even Google's gumbo parser and prevent it from "fixing" this file. I am not sure if this is worth a "fix" or not as using illegal '<' and '>' tags inside a javascript comment inside a script tag is not a good idea. Instead using a normal xhtml comment to include this information or include it as a separate javascript file is much better. It is just something I have never seen before and not something Google tested in its huge testing effort on literally millions of websites. An xhtml parser would have to be able to successfully parse all of javascript to even detect this is a comment and not code. Making gumbo "change" javascript inside a script tag is not something we want to do as it is simply a bad idea. I think instead we will chalk this up as "a really dumb thing to do in general" but something html parsing will accept but not something anyone would ever want in an epub as there is no way for a normal e-reader user to ever see this license. We can revisit this in the future if ever needed. Thank you for your interesting test case.