Sigil's Infamous "colon" Error on File Split

slowsmile · 10-25-2016, 03:59 AM

Well this fault occurs relatively frequently on EpubCheck validation whenever you use the file splitter in Sigil. Here is the error message on EpubCheck from the IDPF validator:

ERROR(RSC-005): Error while parsing file 'value of attribute "id" is invalid; must be an XML name without colons'.

Of course, there are no colons in the id line to speak of so that error message from the Validator is complete hogwash and no help at all.

I kept getting this annoying error on Sigil file split then one day as I was browsing the rules and regs for epubs on the IDPF site I read something interesting which said this, more or less: if you use a 32 char hex id in an 8-4-4-4-12 configuration to denote xml structure ids in the epub then the first hex character in the uid must be an alphabet character. To illustrate this more broadly(with emphasis):

This id will fail IDPF EpubCheck because it start with a numeric digit:
7d0d5c28-5743-40c1-bafa-048c5bba8e6f

But this id will pass because it starts with an alphabet character:
ed0d5c28-5743-40c1-bafa-048c5bba8e6f

So if you get this EpubCheck error on file split, just check the file split idref in the opf spine and manifest and, if necessary, change the first character from a numeric digit to an alphabet character in the range of a to f(because its hex). Do this for both ids in the spine and manifest and the problem will be resolved.

And it would also be quite nice if this problem was fixed in Sigil since it has been with us for such a long time. Can someone please fix this problem?

By the way, the book id in the metadata section is a different can of beans because it isn't part of the epub structure -- so it doesn't matter if this uid starts with a numeric digit or an alpha character.

Doitsu · 10-25-2016, 04:35 AM

This is not a Sigil bug; it's a case of GIGO.

Except when generating a TOC, Sigil does not change/add id values.
It's up to Sigil users to ensure that ids in epub2 files start with a letter.
(You can use ids that start with a number in epub3 files.)

slowsmile · 10-25-2016, 04:52 AM

Sorry Doitsu, not sure what you mean by GIGO. I'm not talking about TOC generation, I'm talking about the uid that is genersted the first time you do a file split in Sigil. That 32 char uid is automatically generated by Sigil. This fault is hit and miss since Sigil's uid generator will generate a uid that can start with either an alpha or a numeric character. This should really be fixed and changed so that Sigil's uid generator generates uid's that start with only alpha characters on a file split. And that's why this problem is a Sigil problem(which isn't helped much by EpubCheck's crappy and misleading error messaging).

Doitsu · 10-25-2016, 05:26 AM

The epubcheck error message that you got was triggered by ids that start with a number in (X)HTML or NCX files.

Sigil generated book ids that start with a number in the .opf or .ncx files won't trigger that message.

For example, I just generated a new epub2 book that was assigned a hex value that starts with number and wasn't flagged.

content.opf

Spoiler:

toc.ncx

Spoiler:

If you still believe that Sigil generated ids cause epubcheck error messages, please provide step-for-step instructions that allow the developers to reproduce this issue.

slowsmile · 10-25-2016, 05:40 AM

I'm not talking about TOCs or Book ids.

I'm talking about the 32 char uid that is generated when you do a file split. Please can you forget about TOCs and ebook uids. Your travelling down the wrong road.

Try this. Open an ebook of yours in Sigil. Then choose a file in the Book Browser and split that file anywhere you like using the File Splitter button in the Sigil Toolbar. After you have split the file, check the content.opf and you will see the rather large uid that has been automatically generated in the spine and in the manifest because of the file split. That's what I'm talking about. And if that large uid -- which is indeed automatically generated by Sigil -- starts with a numeric digit then it will fail IDPF Epubcheck validation online and will give you the "colon" error. Try it for yourself.

Doitsu · 10-25-2016, 06:01 AM

Quote:

Originally Posted by slowsmile

I'm not talking about TOCs or Book ids.

I'm talking about the 32 char uid that is generated when you do a file split. Please can you forget about TOCs and ebook uids. Your travelling down the wrong road.

You didn't explicitly mention that you used Split at Cursor instead of Split at Markers. (Most Sigil users use split markers.)

Split at Cursor does indeed generate item id attributes that might start with a number and will trigger epubcheck error messages.

This is indeed a bug.

As a workaround simply select Insert > Split Marker followed by Edit > Split at Markers or press CTRL+SHIFT+RETURN followed by F6. This will ensure that the id of the split file will start with the file name.

Notjohn · 10-25-2016, 06:15 AM

Is this what we're talking about?:

<dc:identifier opf:scheme="UUID" id="BookId">urn:uuid:3f219299-e69b-41b7-b163-17aeb2668e9b</dc:identifier>

It's an epub2, and it passes Epubcheck.

I always split by placing the cursor, left-clicking, then clicking on the file-split icon in the second menu line. If that's a bad idea, why is the option there and so easy to use?

slowsmile · 10-25-2016, 06:33 AM

No Notjohn, we aren't talking about book ids. We are talking about ids generated when you split the file at the cursor. This generates a uid in the opf -- see below(in bold);

Code:

<manifest>
    <manifest>
    <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
    <item id="styles_css" href="Styles/stylesheet.css" media-type="text/css"/>
    <item id="cover" href="Text/cover.xhtml" media-type="application/xhtml+xml"/>
    <item id="contents" href="Text/contents.xhtml" media-type="application/xhtml+xml"/>
    <item id="Section0002.xhtml" href="Text/Section0002.xhtml" media-type="application/xhtml+xml"/>
    <item id="cover.jpg" href="Images/cover.jpg" media-type="image/jpeg"/>
    <item id="body1" href="Text/Chapter_1.xhtml" media-type="application/xhtml+xml"/>
    <item id="body2" href="Text/Chapter_2.xhtml" media-type="application/xhtml+xml"/>
    <item id="body3" href="Text/Chapter_3.xhtml" media-type="application/xhtml+xml"/>
    <item id="body4" href="Text/Chapter_4.xhtml" media-type="application/xhtml+xml"/>
    <item id="body5" href="Text/Chapter_5.xhtml" media-type="application/xhtml+xml"/>
    <item id="body6" href="Text/Chapter_6.xhtml" media-type="application/xhtml+xml"/>
    <item id="body7" href="Text/Chapter_7.xhtml" media-type="application/xhtml+xml"/>
    <item id="body8" href="Text/Chapter_8.xhtml" media-type="application/xhtml+xml"/>
    <item id="body9" href="Text/Chapter_9.xhtml" media-type="application/xhtml+xml"/>
    <item id="body10" href="Text/Chapter_10.xhtml" media-type="application/xhtml+xml"/>
    <item id="imag25849" href="Images/image001.jpg" media-type="image/jpeg"/>
    <item id="imag70213" href="Images/image002.jpg" media-type="image/jpeg"/>
    <item id="Title.xhtml" href="Text/Title.xhtml" media-type="application/xhtml+xml"/>
    <item id="12c14ef3-e5d0-4c7f-af43-9a5fa134c2bf" href="Text/Section0001.xhtml" media-type="application/xhtml+xml"/>
  </manifest>
  <spine toc="ncx">
    <itemref idref="cover"/>
    <itemref idref="contents"/>
    <itemref idref="Title.xhtml"/>
    <itemref idref="12c14ef3-e5d0-4c7f-af43-9a5fa134c2bf"/>
    <itemref idref="Section0002.xhtml"/>
    <itemref idref="body1"/>
    <itemref idref="body2"/>
    <itemref idref="body3"/>
    <itemref idref="body4"/>
    <itemref idref="body5"/>
    <itemref idref="body6"/>
    <itemref idref="body7"/>
    <itemref idref="body8"/>
    <itemref idref="body9"/>
    <itemref idref="body10"/>
  </spine>

...The above generated id ref will fail EpubCheck because the first letter is a numeric char. The first char must be an alphabet char to pass EpubCheck.

Toxaris · 10-25-2016, 08:09 AM

Quote:

Originally Posted by slowsmile

And it would also be quite nice if this problem was fixed in Sigil since it has been with us for such a long time. Can someone please fix this problem?

I find this funny. You kind of complain that this should have been fixed as this is happening for a long time, yet this is the first time it has been reported afak. If a bug is not reported, it cannot be fixed unless one of the developers happen to catch the bug by accident. Then again, then it would have been reported to themselves.

If you have found this bug quite a while ago, you could have reported it back then. It would have probably have been fixed by now then.

And to call it 'infamous' is not correct in any way and quite harsh. It is not well known at all and it is not bad quality, just a silly prerequisite from the specs that has been corrected in ePUB3.

DiapDealer · 10-25-2016, 08:40 AM

Guys, I can't get Sigil to create a uuid manifest id (valid or otherwise) no matter how I split a file: at markers; at cursor (in Book View or Code View).

EDIT: never mind... I see it. It only seems to happen when there's only one manifested xhtml file in the epub. If more than one file exists, the new split file gets assigned the generated unique file name as its manifest id.

I almost never work with one-file epubs. Easy to see how this would escape detection. Especially if no one ever reports it.

In the meantime: add a blank html file, do your splits, and then delete the blank html file to work around the issue until such time as it gets resolved.

KevinH · 10-25-2016, 11:56 AM

Hi All,

Of course this would only be reported the day AFTER we close the tree to changes so that translators get a chance to update their translations for Sigil-0.9.7? It always seems to happen that way ;-)

If we can fix this without messing up the source line numbers used to key the translations too much, we will try to sneak this fix into the upcoming Sigil-0.9.7 otherwise it will be the first bug fixed for the follow-on release.

Thanks for the bug report.

KevinH

st_albert · 10-25-2016, 12:56 PM

Fascinating!

I've been splitting a single file containing multiple chapters (InDesign CS4 export) for years, and never got bitten by this bug.

Why? Because (1) I always use "split at markers"; and (2) I always rename the split files to something like "chapter002" instead of "Section0001_0001" which also fixes the id's in the opf.

I must be living right!

Albert

DiapDealer · 10-25-2016, 02:12 PM

Quote:

Originally Posted by DiapDealer

In the meantime: add a blank html file, do your splits, and then delete the blank html file to work around the issue until such time as it gets resolved.

Actually, that still doesn't work when splitting the Section0001.xhtml at the cursor. Sorry.

The easiest working workaround is: rename the single xhtml file to anything other than "Section0001.xhtml" before any Splitting at Cursor activity.

In short ... no files named "Section000?.*" and you won't get bit.

slowsmile · 10-25-2016, 06:56 PM

My thanks to KevinH and DiapDealer for recognizing this as a fault.

Some added info that might help: I'm using Sigil v0.9.6 on Windows. This fault usually always occurs randomly whenever I use the File Splitter button(to the right of the Text/Html view button) on the Toolbar. I've always corrected this problem by just directly changing the first char in the uid to an alpha char within the opf file itself.

I've also written several python apps that deal with conversion to epub and, as a necessary consequence and precaution from this problem, I've also written several uid generators that work to only generate uids that always start with an alpha character. I don't know if this will be of any help to KevinH or DiapDealer but an example is given below:

Code:

from random import sample

#========================================================#
#
#     generates a 32 char uid with an alpha char start.
#
def getUID():
    """  Generates a 32 char uid in 8-4-4-4-12 grouping
         with an alpha char always as the start character. 
         This is necessary to avoid epubcheck "colon" errors
         occuring from structural uids used within epub xml files. 
    """

    # create a hex count sample
    a = ['1','2','3','4','5','6','7','8','9','a','b','c','d','e','f']
    
    # split the id into 8-4-4-4-12 grouping
    # and randomize sample length values
    b = sample(a, 8)
    c = sample(a, 4)
    d = sample(a, 4)
    e = sample(a, 4)
    f = sample(a, 12)
    
    # create a random alpha hex value as the first char
    z = ['a', 'b', 'c', 'd', 'e', 'f']
    first_char = sample(z, 1)
    b[0] = first_char[0]

    # merge the groups into strings
    b = ''.join(b)
    c = ''.join(c)
    d = ''.join(d)
    e = ''.join(e)
    f = ''.join(f)
    
    # build the uid
    uid = b + '-' + c + '-' + d + '-' + e + '-' + f
    
    return(uid)

KevinH · 10-25-2016, 08:49 PM

Our bug is in ResourceObjects/OPFResource.cpp in GetUniqueID and is only hit when a preferred id already exists someplace (ie. splitting at a cursor when the original file name is no longer enough to be unique).

Checking if first digit is a number is quite easy and if so prepending a non-number will work just fine.

Code:

QString OPFResource::GetUniqueID(const QString &preferred_id, const OPFParser& p) const
{
    if (p.m_idpos.contains(preferred_id)) {
        return Utility::CreateUUID();
    }
    return preferred_id;
}

10-25-2016, 03:59 AM	#1
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	Sigil's Infamous "colon" Error on File Split Well this fault occurs relatively frequently on EpubCheck validation whenever you use the file splitter in Sigil. Here is the error message on EpubCheck from the IDPF validator: ERROR(RSC-005): Error while parsing file 'value of attribute "id" is invalid; must be an XML name without colons'. Of course, there are no colons in the id line to speak of so that error message from the Validator is complete hogwash and no help at all. I kept getting this annoying error on Sigil file split then one day as I was browsing the rules and regs for epubs on the IDPF site I read something interesting which said this, more or less: if you use a 32 char hex id in an 8-4-4-4-12 configuration to denote xml structure ids in the epub then the first hex character in the uid must be an alphabet character. To illustrate this more broadly(with emphasis): This id will fail IDPF EpubCheck because it start with a numeric digit: 7d0d5c28-5743-40c1-bafa-048c5bba8e6f But this id will pass because it starts with an alphabet character: ed0d5c28-5743-40c1-bafa-048c5bba8e6f So if you get this EpubCheck error on file split, just check the file split idref in the opf spine and manifest and, if necessary, change the first character from a numeric digit to an alphabet character in the range of a to f(because its hex). Do this for both ids in the spine and manifest and the problem will be resolved. And it would also be quite nice if this problem was fixed in Sigil since it has been with us for such a long time. Can someone please fix this problem? By the way, the book id in the metadata section is a different can of beans because it isn't part of the epub structure -- so it doesn't matter if this uid starts with a numeric digit or an alpha character. Last edited by slowsmile; 10-25-2016 at 06:40 AM.

10-25-2016, 04:52 AM	#3
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	Sorry Doitsu, not sure what you mean by GIGO. I'm not talking about TOC generation, I'm talking about the uid that is genersted the first time you do a file split in Sigil. That 32 char uid is automatically generated by Sigil. This fault is hit and miss since Sigil's uid generator will generate a uid that can start with either an alpha or a numeric character. This should really be fixed and changed so that Sigil's uid generator generates uid's that start with only alpha characters on a file split. And that's why this problem is a Sigil problem(which isn't helped much by EpubCheck's crappy and misleading error messaging). Last edited by slowsmile; 10-25-2016 at 05:20 AM.

10-25-2016, 05:40 AM	#5
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	I'm not talking about TOCs or Book ids. I'm talking about the 32 char uid that is generated when you do a file split. Please can you forget about TOCs and ebook uids. Your travelling down the wrong road. Try this. Open an ebook of yours in Sigil. Then choose a file in the Book Browser and split that file anywhere you like using the File Splitter button in the Sigil Toolbar. After you have split the file, check the content.opf and you will see the rather large uid that has been automatically generated in the spine and in the manifest because of the file split. That's what I'm talking about. And if that large uid -- which is indeed automatically generated by Sigil -- starts with a numeric digit then it will fail IDPF Epubcheck validation online and will give you the "colon" error. Try it for yourself. Last edited by slowsmile; 10-25-2016 at 05:52 AM.

10-25-2016, 08:40 AM	#10
DiapDealer Grand Sorcerer Posts: 29,447 Karma: 212177022 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Guys, I can't get Sigil to create a uuid manifest id (valid or otherwise) no matter how I split a file: at markers; at cursor (in Book View or Code View). EDIT: never mind... I see it. It only seems to happen when there's only one manifested xhtml file in the epub. If more than one file exists, the new split file gets assigned the generated unique file name as its manifest id. I almost never work with one-file epubs. Easy to see how this would escape detection. Especially if no one ever reports it. In the meantime: add a blank html file, do your splits, and then delete the blank html file to work around the issue until such time as it gets resolved. Last edited by DiapDealer; 10-25-2016 at 12:13 PM.

10-25-2016, 06:56 PM	#14
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	My thanks to KevinH and DiapDealer for recognizing this as a fault. Some added info that might help: I'm using Sigil v0.9.6 on Windows. This fault usually always occurs randomly whenever I use the File Splitter button(to the right of the Text/Html view button) on the Toolbar. I've always corrected this problem by just directly changing the first char in the uid to an alpha char within the opf file itself. I've also written several python apps that deal with conversion to epub and, as a necessary consequence and precaution from this problem, I've also written several uid generators that work to only generate uids that always start with an alpha character. I don't know if this will be of any help to KevinH or DiapDealer but an example is given below: Code: from random import sample #========================================================# # # generates a 32 char uid with an alpha char start. # def getUID(): """ Generates a 32 char uid in 8-4-4-4-12 grouping with an alpha char always as the start character. This is necessary to avoid epubcheck "colon" errors occuring from structural uids used within epub xml files. """ # create a hex count sample a = ['1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'] # split the id into 8-4-4-4-12 grouping # and randomize sample length values b = sample(a, 8) c = sample(a, 4) d = sample(a, 4) e = sample(a, 4) f = sample(a, 12) # create a random alpha hex value as the first char z = ['a', 'b', 'c', 'd', 'e', 'f'] first_char = sample(z, 1) b[0] = first_char[0] # merge the groups into strings b = ''.join(b) c = ''.join(c) d = ''.join(d) e = ''.join(e) f = ''.join(f) # build the uid uid = b + '-' + c + '-' + d + '-' + e + '-' + f return(uid) Last edited by slowsmile; 10-25-2016 at 08:59 PM.

10-25-2016, 04:35 AM	#2
Doitsu Grand Sorcerer Posts: 5,819 Karma: 24222221 Join Date: Dec 2010 Device: Kindle PW2	This is not a Sigil bug; it's a case of GIGO. Except when generating a TOC, Sigil does not change/add id values. It's up to Sigil users to ensure that ids in epub2 files start with a letter. (You can use ids that start with a number in epub3 files.)

10-25-2016, 06:15 AM	#7
Notjohn mostly an observer Posts: 1,519 Karma: 996810 Join Date: Dec 2012 Device: Kindle	Is this what we're talking about?: <dc:identifier opf:scheme="UUID" id="BookId">urn:uuid:3f219299-e69b-41b7-b163-17aeb2668e9b</dc:identifier> It's an epub2, and it passes Epubcheck. I always split by placing the cursor, left-clicking, then clicking on the file-split icon in the second menu line. If that's a bad idea, why is the option there and so easy to use?

10-25-2016, 11:56 AM	#11
KevinH Sigil Developer Posts: 9,667 Karma: 6774048 Join Date: Nov 2009 Device: many	Hi All, Of course this would only be reported the day AFTER we close the tree to changes so that translators get a chance to update their translations for Sigil-0.9.7? It always seems to happen that way ;-) If we can fix this without messing up the source line numbers used to key the translations too much, we will try to sneak this fix into the upcoming Sigil-0.9.7 otherwise it will be the first bug fixed for the follow-on release. Thanks for the bug report. KevinH

10-25-2016, 12:56 PM	#12
st_albert Guru Posts: 698 Karma: 150000 Join Date: Feb 2010 Device: none	Fascinating! I've been splitting a single file containing multiple chapters (InDesign CS4 export) for years, and never got bitten by this bug. Why? Because (1) I always use "split at markers"; and (2) I always rename the split files to something like "chapter002" instead of "Section0001_0001" which also fixes the id's in the opf. I must be living right! Albert

10-25-2016, 08:49 PM	#15
KevinH Sigil Developer Posts: 9,667 Karma: 6774048 Join Date: Nov 2009 Device: many	Our bug is in ResourceObjects/OPFResource.cpp in GetUniqueID and is only hit when a preferred id already exists someplace (ie. splitting at a cursor when the original file name is no longer enough to be unique). Checking if first digit is a number is quite easy and if so prepending a non-number will work just fine. Code: QString OPFResource::GetUniqueID(const QString &preferred_id, const OPFParser& p) const { if (p.m_idpos.contains(preferred_id)) { return Utility::CreateUUID(); } return preferred_id; }

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
this file converts with no error but the resulting file is "invalid"	mlevin77	Conversion	3	01-11-2014 08:34 AM
Sigil "Split " issues	Russellsstudent	Sigil	4	03-12-2013 10:07 AM
ES file explorer: getting error "network path not found or timed out"	JoeyBlaze	Amazon Fire	29	03-05-2012 03:07 PM
"PK": Only text when I open in Sigil an ePub file generated with Calibre	Terisa de morgan	Sigil	3	12-14-2009 11:24 AM
The "Infamous Kindle Letter"	Dr. Drib	Amazon Kindle	24	11-10-2009 06:56 PM