MobileRead Forums - View Single Post - questions on self-closing tags and legal xhtml in epubs

KevinH · 04-01-2012, 06:39 PM

Hi,

I have been playing around with html5lib and lxml in python and libxml2 in c to write code to process epubs and have run into difficulties parsing xhtml documents with the following self-closing tags. Are these legal in strict xhtml as used in epub 2? Are they still legal for epub 3.

<title />

<a id="blah" />

<div id="blah" />

<div id="blah" class="clearfix" />

When I parse xhtml with these self closing tags in them the parsers (and this must all tie back to libxml2 since they all are front ends to that library I believe) the get very confused and either start replacing tag < and > with their html entities, or they assume the ending tag is never found and add a new ending tag much much farther on, which can easily change the meaning especially for the float region "clearfix" class approach.

Even modern browsers seem to have trouble dealing with these particular self-closing tags.

I know in pure xml almost any tag can be a self-closing tag, but I thought under strict XHTML for epubs only specific tags like <meta /> and <hr /> were allowed to be self-closing and that all others must be explicitly and separately closed to guarantee proper ebook viewing.

Does anyone know the exact spec. Having to work around these bugs is quite painful and looking for and fixing all of these before parsing the xhtml makes things quite slow at times.

Ideas anyone?

Thanks,

Kevin

04-01-2012, 06:39 PM	#1
KevinH Sigil Developer Posts: 9,077 Karma: 6361556 Join Date: Nov 2009 Device: many	questions on self-closing tags and legal xhtml in epubs Hi, I have been playing around with html5lib and lxml in python and libxml2 in c to write code to process epubs and have run into difficulties parsing xhtml documents with the following self-closing tags. Are these legal in strict xhtml as used in epub 2? Are they still legal for epub 3. <title /> <a id="blah" /> <div id="blah" /> <div id="blah" class="clearfix" /> When I parse xhtml with these self closing tags in them the parsers (and this must all tie back to libxml2 since they all are front ends to that library I believe) the get very confused and either start replacing tag < and > with their html entities, or they assume the ending tag is never found and add a new ending tag much much farther on, which can easily change the meaning especially for the float region "clearfix" class approach. Even modern browsers seem to have trouble dealing with these particular self-closing tags. I know in pure xml almost any tag can be a self-closing tag, but I thought under strict XHTML for epubs only specific tags like <meta /> and <hr /> were allowed to be self-closing and that all others must be explicitly and separately closed to guarantee proper ebook viewing. Does anyone know the exact spec. Having to work around these bugs is quite painful and looking for and fixing all of these before parsing the xhtml makes things quite slow at times. Ideas anyone? Thanks, Kevin