View Full Version : questions on self-closing tags and legal xhtml in epubs


KevinH
04-01-2012, 05:39 PM
Hi,

I have been playing around with html5lib and lxml in python and libxml2 in c to write code to process epubs and have run into difficulties parsing xhtml documents with the following self-closing tags. Are these legal in strict xhtml as used in epub 2? Are they still legal for epub 3.

<title />

<a id="blah" />

<div id="blah" />

<div id="blah" class="clearfix" />

When I parse xhtml with these self closing tags in them the parsers (and this must all tie back to libxml2 since they all are front ends to that library I believe) the get very confused and either start replacing tag < and > with their html entities, or they assume the ending tag is never found and add a new ending tag much much farther on, which can easily change the meaning especially for the float region "clearfix" class approach.

Even modern browsers seem to have trouble dealing with these particular self-closing tags.

I know in pure xml almost any tag can be a self-closing tag, but I thought under strict XHTML for epubs only specific tags like <meta /> and <hr /> were allowed to be self-closing and that all others must be explicitly and separately closed to guarantee proper ebook viewing.

Does anyone know the exact spec. Having to work around these bugs is quite painful and looking for and fixing all of these before parsing the xhtml makes things quite slow at times.

Ideas anyone?

Thanks,

Kevin

DaleDe
04-01-2012, 09:11 PM
Self closing tags should be those things that don't have data. There is no reason to have a <div> that is self closing. It makes no sense at all as div is meant to enclose something. You can just assign the id to a different tag. There is generally no reason to use the a tag by itself any longer for the same reason. I wouldn't bother using it for title either.

KevinH
04-01-2012, 09:54 PM
Hi,

Thanks for your response. I don't want to use them. I am finding them in the wild inside epubs and they are not being viewed properly by some ebook readers that I have access to (and not all browsers either) and are not handled properly by lxml, html5lib and libxml2 which are often used to parse xhtml and is typically used inside ebook reading / handling software like kindlegen, calibre, sigil, etc.

I think the "clearfix" example is often used to fix bugs when using css to float and image right or left. This float behaviour often needs to be cleared. The div can contain a class that actually clears the float but not contain anything else as the following text needs to wrap around the floated image. The others are simply strange to me but they do exists even inside commercial epubs.

I was hoping that someone would have some idea if they were actually legal xhtml or an artifact of xml processing software used to improperly handle xhtml code.

KevinH


Self closing tags should be those things that don't have data. There is no reason to have a <div> that is self closing. It makes no sense at all as div is meant to enclose something. You can just assign the id to a different tag. There is generally no reason to use the a tag by itself any longer for the same reason. I wouldn't bother using it for title either.

DaleDe
04-02-2012, 01:07 AM
Ahh, You can read about ePub in our wiki and I am pretty sure these are not legal in xhtml. The wiki has links to the specs. While xhtml is designed after xml it is really designed to make html conform to the standards of xml, not to turn it into some arbitrary xml. Hope this helps. Certainly you are right, many ebook readers will not interpret these like xml. They are all basically designed for html that conforms to xml and this is what the spec says.

Dale

Jellby
04-02-2012, 01:53 PM
I've used self-closing divs some times. For scene breaks, where I want an empty div with a fixed height, writing <div class="break" /> seemed cleaner than <div class="break></div>. It works fine in my reader (ADE-based) and didn't cause flightcrew to complain. Last time I checked I arrived to the conclusion it was valid.

user_none
04-23-2012, 10:12 PM
To fully answer this question. Yes, self closing tags (a and div elements in this example) is perfectly valid according to the EPUB spec. The following example conforms to the EPUB 2 spec.


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>

<body>
<p>
<a id="blah" />
</p>

<div id="blah1" />

<div id="blah2" class="clearfix" />
</body>
</html>


The relevant sections of the EPUB 2 Spec are 1.4.1.2 (http://idpf.org/epub/20/spec/OPS_2.0.1_draft.htm#Section1.4.1.2) and Appendix A (http://idpf.org/epub/20/spec/OPS_2.0.1_draft.htm#AppendixA). XHTML 1.1 does allow self closing tags as used above. Specifically Appendix a requires that the document validate against the XHTML 1.1 DTD (http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd).