View Full Version : Confused! XHTML, HTML, HTML5, EPUB2, EPUB3???


carlosbcg
02-19-2013, 08:58 PM
Hi everyone,

This is my first post. I normally try and figure things out myself as that is faster than trying to get help somewhere but after spending countless hours reading and trying to figure this out I thought it wouldn't hurt for me to pick some brains on this :).

Here is what I think I know so far...(I am working on my first book and wanting to create it as an EPUB3).

EPUB2 uses CSS2 and HTML inside. But, and here is where it gets a bit confusing right off the bat, the HTML must be in XHTML format but having an extension of...well...HTML?

EPUB3 uses CSS3 and HTML5 but it too must be in XHTML format?? If so what brand of XHTML? And the extension of the HTML5 (or is it XHTML files) must be...hmm...HTML again?

Can one create an EPUB3 file with plain ol HTML5 and CSS3 without the XHTML business?

What is the difference between XHTML and HTML5?

Is HTML5 just the latest and greatest HTML while XHTML is...I dunno :smack:

Among the hats I have worn in the past is that of a web developer so I am very familiar with CSS and HTML (even PHP) but hardly at all with XHTML and HTML5.

Any insight on any of this gobbly gook anyone might care to share with me would be most appreciated.

Thanks!

Carlos

carlosbcg
02-19-2013, 09:24 PM
Here is more confusing goodness :smack:

I just read this at ibm developer works...

"if you're migrating from an EPUB 2 to EPUB 3 workflow, consider starting by converting from existing NCX documents. Because both input and output documents are XML, this is a perfect application for XSLT."

What in the world is XML and XSLT??

So now we have HTML, HTML5, XHTML, EPUB2, EPUB3, XML, and XSLT!

Sigh.

Carlos

Turtle91
02-20-2013, 12:35 AM
Welcome to the Forum!

It can be confusing at first. But I'll give a very basic overview and then point you to where you can get some good info.

XHTML is just HTML with a stricter adherence to some rules. For example - you are required to have proper closing tags on ALL of your elements, and attribute names need to be lower case. As long as you follow the rules you can name it .html and no one is the wiser. The .xhtml is/was just a way to tell the difference for software programs that cared.

CSS2 is the list of attributes that you can use to describe html elements. CSS3 is the new list with extended capabilities/attributes. Depending on the particular reader App/Device they can support any, none, or all of the CSS3 list. If you want the widest possible support of your epub, you should use ePub2 and stick to the CSS2 list.

HTML5 is just the next generation of HTML. It was 4.01, but they added some new capabilities/tags.

ePub2 is what almost all ebooks are written in now - because the reader Apps/Devices don't yet support the advanced functionality in the new version ePub3. This is changing - slowly. Kindle, iBooks, Kobo, are among the few who are STARTING to support the functionality - but they are using their own hybrid...so that's a mess.

The ncx file is what ePub 2 uses to navigate through the documents - think Table of Contents. ePub 3 uses more of an HTML based document for it's navigation.

XML - don't know...I avoid it! :D

XSLT - don't know about that either, but from the context I would gather it is an interpretation program from one format to another??

OK...pretty quick and dirty.

The W3Schools website has some great tutorials that can explain all of this much better, and they have reference pages as well to show you what all the different tags mean and how they are used.

http://www.w3schools.com/html/default.asp

I hope that helps!

Cheers,

dgatwood
02-20-2013, 12:36 AM
From the top:

SGML stands for Standard Generalized Markup Language. There are many SGMLs. Crudely put, an SGML is any markup language that is characterized by any arbitrary set of tags surrounded by angle braces, with certain bits at the beginning to tell you what type of file it is, and there are probably a few other rules.

XML is a strict subset of SGMLs. XML is a strict subset of SGMLs in which, among other things, all tags must be matched with a close tag, and a few other details. There are many dialects of XML (an XML dialect is basically just a specific set of allowed tags that can be nested in specific ways), including DocBook, XHTML, property lists, and so on.

HTML is an example of an SGML. HTML has a specific set of tags that are considered valid. HTML is not, however, based on XML, because some tags do not have to be closed at all (hr, script when a URL is provided, and so on), and some tags auto-close at the right time (p, li, etc.). [Edit: And, as Turtle91 pointed out, HTML specifies case-insensitive tag and attribute parsing, whereas XML specifies case-sensitive tag and attribute parsing, which, in the case of XHTML's built-in tags and attributes, translates to "all lowercase".]

HTML5 is a specific version of HTML. Like all HTMLs, it is an SGML, but HTML5 files are not (necessarily) proper XML.

XHTML is a special form of HTML that has been modified slightly so that every XHTML file is a proper XML file that conforms to the stricter XML standards. This requires a few tiny tweaks around the fringes, but it mostly looks like HTML with some extra close tags or self-closing tags.

XSLT is another XML dialect. An XSLT stylesheet provides a set of rules for transforming from one XML dialect to another (typically, though in practice, it can be used to translate a specified XML dialect into pretty much anything, up to and including LaTeX commands).

EPUB2 and EPUB3 are versions of EPUB. EPUB2 uses XHTML under the hood. EPUB3 uses HTML5, but it must be parseable as XML. So it must be a polyglot XML/HTML5 document. This polyglot is called XHTML5, but is defined as part of the HTML5 standard rather than in a separate standard as previous XHTML versions were.

Clear as mud?

Toxaris
02-20-2013, 02:44 AM
That sums it about up. You can consider the XSLT as the CSS for XML files.

Be aware that there are some more differences with regards to XHTML and HTML, but not earth-shaking. The extension is not important, it is the first line in the document that tells the renderers (e.g. browsers) how to interpret the document.
An ePUB file must be XHTML. I would advise that if you don't need javascript/audio/video and stuff like that, to create an ePUB2 instead of ePUB3.

mrmikel
02-20-2013, 06:02 AM
To add to your misery, NO device fully supports any of the above.

So start with something simple, even a few lines. View it a reader for whatever device you are aiming at and go from there. Epubs containing mostly text are going to have the least problems, so that might be a place to start. Sigil is great program that handles many of the details for you, if you have an html file to feed it, or you can type in directly or paste in text. Many start with Word, export as filtered html, or use a macro by Toxaris available here, then open in Sigil to finish off things.

There is a library here which has thousands of books. You can open the files that make up the epubs with calibre or sigil, or unzip them and display them in any text editor you like. Just make sure you do not re-zip them yourself without knowing a few further arcane rules.

Jellby
02-20-2013, 06:26 AM
Regarding CSS, neither ePub 2 or ePub 3 supports neither CSS2 or CSS3. They both support some subset of them with some additional properties.

But "support" here simply means that compliant readers are required to know some properties and values (not always to do anything useful with them). Unfortunately, no reader actually supports what they are required too. And then any reader is allowed to support additional properties.

Since CSS is designed to ignore unknown properties, this means you should be able to use whatever you want in the CSS (as long as it's syntactically correct), but can never be sure of the effect it will have on a reader without trying. I mean, the resulting ePub will be valid, but may not work exactly as intended.

DiapDealer
02-20-2013, 07:22 AM
And for the record, I don't think the extension of the (x)html(5) files really matters at all--as long as the content is compliant and they're manifested properly.

DaleDe
02-20-2013, 01:40 PM
You can also read about all of these things in our wiki. And as DiapDealer said, don't be fooled by a file extension. The content is what matters not the extension particularly for all the HTML, XML variations.

Dale

Toxaris
02-20-2013, 02:30 PM
I am not quite sure he is less confused now...

Turtle91
02-20-2013, 02:41 PM
I am not quite sure he is less confused now...

HE might not be less confused...but I certainly am. I know for a FACT that I'm crazy to get involved with this stuff! :)

twobits
02-20-2013, 04:26 PM
From the top:

SGML stands for Standard Generalized Markup Language. There are many SGMLs. Crudely put, an SGML is any markup language that is characterized by any arbitrary set of tags surrounded by angle braces, with certain bits at the beginning to tell you what type of file it is, and there are probably a few other rules.


There is only one SGML actually. It is an ISO standard now and descended from GML. Overall though this was a pretty good summary, except you missed one key piece.

DTD, or Document Type Declaration. This defines what tags and rules for them make up a valid document for that document type.


XML is a strict subset of SGMLs. XML is a strict subset of SGMLs in which, among other things, all tags must be matched with a close tag, and a few other details. There are many dialects of XML (an XML dialect is basically just a specific set of allowed tags that can be nested in specific ways), including DocBook, XHTML, property lists, and so on.


It is not a dialect of XML but a DTD for XML.


HTML is an example of an SGML. HTML has a specific set of tags that are considered valid. HTML is not, however, based on XML, because some tags do not have to be closed at all (hr, script when a URL is provided, and so on), and some tags auto-close at the right time (p, li, etc.). [Edit: And, as Turtle91 pointed out, HTML specifies case-insensitive tag and attribute parsing, whereas XML specifies case-sensitive tag and attribute parsing, which, in the case of XHTML's built-in tags and attributes, translates to "all lowercase".]


At first html was only modeled on sgml, but was more adhoc then sgml allowed. It was not until later (4.0 or 3.2 can't recall which off hand) that it was given a formal dtd that made it true sgml.


HTML5 is a specific version of HTML. Like all HTMLs, it is an SGML, but HTML5 files are not (necessarily) proper XML.

XHTML is a special form of HTML that has been modified slightly so that every XHTML file is a proper XML file that conforms to the stricter XML standards. This requires a few tiny tweaks around the fringes, but it mostly looks like HTML with some extra close tags or self-closing tags.


Right about XHTML, but it is probably worth noting that XHTML is simple a DTD for XML.


XSLT is another XML dialect. An XSLT stylesheet provides a set of rules for transforming from one XML dialect to another (typically, though in practice, it can be used to translate a specified XML dialect into pretty much anything, up to and including LaTeX commands).


Actually XSLT is a Turing complete language. To use it you usually also need to learn XQuery and XPath.


EPUB2 and EPUB3 are versions of EPUB. EPUB2 uses XHTML under the hood. EPUB3 uses HTML5, but it must be parseable as XML. So it must be a polyglot XML/HTML5 document. This polyglot is called XHTML5, but is defined as part of the HTML5 standard rather than in a separate standard as previous XHTML versions were.

Clear as mud?

Damn alphabet soup ! I hate XML! lol

twobits
02-20-2013, 04:29 PM
That sums it about up. You can consider the XSLT as the CSS for XML files.


Not properly. CSS defines how to render/display the document. XSLT defines how to transform it from one DTD to another DTD or type.

carlosbcg
02-20-2013, 10:13 PM
Welcome to the Forum!

It can be confusing at first. But I'll give a very basic overview and then point you to where you can get some good info.


Thanks very much for taking the time to post that Turtle (or maybe Dion?).

The only part that was a tad seemingly contradictory to me is the part where you said you don't know what XML is and that you don't do anything with that given that I have discovered that both EPUB2 and EPUB3 use XML for a couple of files that are crucial.

But other than that what you said made sense.

Carlos

carlosbcg
02-20-2013, 10:24 PM
Clear as mud?


Ahhh...well...mostly LOL.

You did an excellent job of explaining things.

There were however a couple of spots where things are still a whee bit confusing if you or someone else could expand and explain a bit more.

Specifically...


HTML5 is a specific version of HTML. Like all HTMLs, it is an SGML, but HTML5 files are not (necessarily) proper XML.


Hmm...but...but...don't EPUB3 internal files holding the actual content of an ebook as HTML5 files (albeit with an extension of HTML) have to be what is termed "serialized XHTML" (not altogether sure what that means but I think it means pretty much XHTML)?

In other words don't EPUB3 content internals HAVE to be the XHTML variant of the HTML5?

I am creating my ebook in EPUB3 by the way following the lead of Oreilly. I figure if it's good enough for them it's good enough for me.


XSLT is another XML dialect. An XSLT stylesheet provides a set of rules for transforming from one XML dialect to another (typically, though in practice, it can be used to translate a specified XML dialect into pretty much anything, up to and including LaTeX commands).


Hmm...interesting. I take it then that XSLT is completely uneccessary to creation of an EPUB?

But just out of curiosity...how exactly is an XSLT file with XML commands in it get executed to do it's conversion work? Does a browser execute the XSLT commands or something?


EPUB2 and EPUB3 are versions of EPUB. EPUB2 uses XHTML under the hood. EPUB3 uses HTML5, but it must be parseable as XML. So it must be a polyglot XML/HTML5 document. This polyglot is called XHTML5, but is defined as part of the HTML5 standard rather than in a separate standard as previous XHTML versions were.


That's quite the deep sentence there.

What is a polyglot document? Do you mean a document which has both XML and HTML5?

So are XHTML5 and HTML5 the same thing? I mean if XHTML5 is defined as part of the HTML5 standard I mean and not separately like in the past?

So can I refer to the 5 thing as either XHTML5 OR HTML5?

Any further clarification from you or anyone else would be appreciated.

I think I am finally beginning to make heads or tails of this.

Carlos

carlosbcg
02-20-2013, 10:31 PM
I would advise that if you don't need javascript/audio/video and stuff like that, to create an ePUB2 instead of ePUB3.

Hmm...I'll have to think about that. I like EPUB3 for reasons that don't have anything to do with that stuff (at least not yet for me).

It's clearer and less confusing to me than EPUB2 which seems more of a glued together standard and less cohesive. I suppose you could rightly say it's a more mature EPUB.

The documentation for EPUB3 that I have seen is far superior to that which I have seen for EPUB2 (which again is all over the place...I like definite and don't like somewhat nebulous and open ended).

OReilly has chosen to create their books in EPUB3 though they do take steps to be backwards compatible.

The DOCTYPE is real easy to grasp LOL. DOCTYPE html. That's it. Though Sigil messes it up since it doesn't do EPUB3 yet (but then again I am writing scripts to spit out this and other EPUBs of mine in the future so I won't be using Sigil).

More flexibility in how the TOC is done.

More CSS goodness available.

That's about it for now I think.

Carlos

carlosbcg
02-20-2013, 10:41 PM
To add to your misery, NO device fully supports any of the above.


I'm very familiar with that type of thing. It's exactly like it was when web browsers first came out and weren't all on board with respect to abiding by standards either.


Sigil is great program that handles many of the details for you, if you have an html file to feed it, or you can type in directly or paste in text.


It IS most definitely quite the program but I have noticed lately that it does some things which make it unlikely that I will use it. It messes up my DOCTYPE and uses a structure that is not recommended by the standard or that lines up with the epub directory structure which I am using.

Plus it creates some files like the mimetype behind the scenes such that I can't see what it is doing until after it saves my epub and I can look inside it.

I am a detail person who likes control over every aspect of what I am doing if I can conveniently achieve that and don't generally like it when things are created or happen without my having any idea of what happened or where things went or came from.

If one uses Sigil it would appear that one must then stick to using Sigil to have things go as smoothly as possible.

I create content in markup (with some hand coded HTML), use pandoc to convert to whatever, and then place the resultant produced files (whether HTML for a web page, XHTML for epub, PDF, or whatever) in various directories. A script I am creating will Zip up the files into an EPUB so I won't even need something like Sigil (though I understand there is a script here that does that too...most likely for Windows though...I use Linux).

Carlos

carlosbcg
02-20-2013, 10:43 PM
I am not quite sure he is less confused now...

I am much less confused than I was when I posted the OP. Seriously.

Your explanations have been great!

Carlos

dgatwood
02-21-2013, 01:52 AM
HTML5 is a specific version of HTML. Like all HTMLs, it is an SGML, but HTML5 files are not (necessarily) proper XML.


Hmm...but...but...don't EPUB3 internal files holding the actual content of an ebook as HTML5 files (albeit with an extension of HTML) have to be what is termed "serialized XHTML" (not altogether sure what that means but I think it means pretty much XHTML)?

In other words don't EPUB3 content internals HAVE to be the XHTML variant of the HTML5?


Yes, AFAIK. But not all HTML5 content is in an EPUB. You can use HTML5 on the web, you know. :D




Hmm...interesting. I take it then that XSLT is completely uneccessary to creation of an EPUB?


Yes. It does make a convenient way to transform DocBook (or other XML dialects) into (X)HTML, but it is certainly not the only way (or even necessarily the best way).



But just out of curiosity...how exactly is an XSLT file with XML commands in it get executed to do it's conversion work? Does a browser execute the XSLT commands or something?


You use an XSLT processor. That's a tool that takes an XSLT file and applies it to an XML source file.



What is a polyglot document? Do you mean a document which has both XML and HTML5?


A polyglot is a file that is simultaneously interpretable according to the rules of two different formats or languages. In this case, I mean a file that is HTML5, but is fully compliant with XML. In other words an XML serialization of HTML.



So are XHTML5 and HTML5 the same thing? I mean if XHTML5 is defined as part of the HTML5 standard I mean and not separately like in the past?


No, and yes. If you are talking about something that must be XML-compatible, then it is a good idea to call it XHTML5, because HTML5 is not necessarily valid XML.

dgatwood
02-21-2013, 01:58 AM
There is only one SGML actually. It is an ISO standard now and descended from GML.

Potaytoe, potahtoe. SGML is an ISO standard technology for defining markup languages. HTML is an SGML-based language. I guess I should have used the word "dialect" to be pedantic, though.



It is not a dialect of XML but a DTD for XML.


I've never heard of anyone who didn't use those terms interchangeably. A dialect of XML generally means a markup language based on XML that conforms to a particular DTD.




At first html was only modeled on sgml, but was more adhoc then sgml allowed. It was not until later (4.0 or 3.2 can't recall which off hand) that it was given a formal dtd that made it true sgml.


True. Early HTML was... a big pile of hurt.




Actually XSLT is a Turing complete language. To use it you usually also need to learn XQuery and XPath.


Worse, it's a Turing complete template language. To use it, you have to wrap your head around the concept of template-based languages, which inherently have no real notion of state. It is enough to cause brain damage in programmers, in much the same way that LaTeX does, and for precisely the same reason. :D

Put another way, even though I've modified XSLT for transforming XML to other output formats many times over the years, when I'm asked to write such a tool from scratch, I invariably end up writing it in Perl or C or some other actual programming language rather than a template language like XSLT. (Or, occasionally, Bourne shell scripts, if I want to cause people nightmares that they never wake up from. :D)

twobits
02-21-2013, 02:51 AM
Potaytoe, potahtoe. SGML is an ISO standard technology for defining markup languages. HTML is an SGML-based language. I guess I should have used the word "dialect" to be pedantic, though.

I've never heard of anyone who didn't use those terms interchangeably. A dialect of XML generally means a markup language based on XML that conforms to a particular DTD.


Hmm.. could be a regional difference or maybe a temporal one. I have not had to deal with this stuff professionally now for at least six years. At the time though we would consider a DTD to be an instance not a dialect. Never heard dialect used at all.


Worse, it's a Turing complete template language. To use it, you have to wrap your head around the concept of template-based languages, which inherently have no real notion of state. It is enough to cause brain damage in programmers, in much the same way that LaTeX does, and for precisely the same reason. :D

Put another way, even though I've modified XSLT for transforming XML to other output formats many times over the years, when I'm asked to write such a tool from scratch, I invariably end up writing it in Perl or C or some other actual programming language rather than a template language like XSLT. (Or, occasionally, Bourne shell scripts, if I want to cause people nightmares that they never wake up from. :D)

I never really minded TeX and most of LaTeX, guess because it beat other ways to get decent laser printed output I knew of. I too would use Perl or C when I had the choice, never did use Bourne for it, as I had to create things that often would work on a few OSes (VMS/OS2/Solaris/Windows) so setting them run under perl was the best bet as bourne shell clones for VMS never did handle its distinct path specification system well.

Guess I am going way :offtopic:

Toxaris
02-21-2013, 01:40 PM
I'm very familiar with that type of thing. It's exactly like it was when web browsers first came out and weren't all on board with respect to abiding by standards either.


Not quite the same. Browser could be fixed with an update. That will not happen with readers, as updates are scarce.

What kind of CSS3 things do you need for your book? If you can do it, use ePUB2. It is a fixed format that is supported much better. Unless you want to alienate your readers...

dgatwood
02-21-2013, 08:52 PM
Not quite the same. Browser could be fixed with an update. That will not happen with readers, as updates are scarce.


To be fair, part of what makes web development so problematic is the sheer number of people who have to keep using an old version of a browser (IE comes to mind) because some business-critical system doesn't work with the newer version, so you had (and still have) a fair bit of that problem in the browser world, too.

The bigger difference is that even in the early days of the web, there were only a handful of browsers that you really had to care about—Netscape, Internet Explorer, maybe Mosaic, maybe Lynx, and that was about it. You might have to care about a couple of versions of a couple of browsers from each vendor (Communicator vs. Navigator, for example), but either way, it was pretty much bounded at a single-digit number of browsers.

These days, many companies manufacturer multiple readers that don't use the same reader software (Amazon, I'm looking at you in particular), each with a different set of bugs. The resulting fragmentation in the eBook reader space today makes web development in the 90s seem positively tame by comparison.

But in concept, it's very much the same sort of situation, just turned up to 11. :D

carlosbcg
02-21-2013, 09:56 PM
What kind of CSS3 things do you need for your book? If you can do it, use ePUB2. It is a fixed format that is supported much better. Unless you want to alienate your readers...

Maybe I will indeed switch back. I just like the best and latest and some aspects of EPUB3 are nice (not including all the fancy stuff).

Definitely do not want to alienate readers :).

Carlos

ghostyjack
02-22-2013, 12:26 PM
Maybe I will indeed switch back. I just like the best and latest and some aspects of EPUB3 are nice (not including all the fancy stuff).

Definitely do not want to alienate readers :).

Carlos

Best and latest are not always the same thing.

At present, the best would be Epub2 due to the the massive number of reading devices (software and hardware) that can interprest this format. I'm ignoring any techical improvements in Epub3 due to the fact that almost nothing can take advantage of them.

The Latest would be Epub3, but as mentioned, finding something to handle the format is a bit thin on the ground.

carlosbcg
02-22-2013, 08:23 PM
Best and latest are not always the same thing.


I agree.


At present, the best would be Epub2 due to the the massive number of reading devices (software and hardware) that can interprest this format.


It is difficult for me to wrap my head around the lack of EPUB3 reader compatibility.

I mean it's just HTML5 (XHTML5 actually). Most all web browsers of any note can display that just fine.

So what's the big deal?

If they can display that then there should be readers galore that are able to do the same thing using whatever major browser rendering engine they care to use.

I am rather surprised by the lack of readers for EPUB3.

I mean the only other thing besides being able to render HTML5 (XHTML5) are the fancy keys that one can use to navigate what one is seeing and the uncompression and deciphering of the EPUB (again no big deal).

I just don't see what the big deal is about creating readers that can read HTML5 and display it in a nice interface. I'd create one myself if I was inclined to spend the time to create one but I have other things on my plate just now.

But I have made the switch back to EPUB2 and updated my scripts to reflect that change yesterday.

Carlos

DaleDe
02-22-2013, 09:07 PM
Carlos: A browser does not an eBook reader make. Yes, most browsers have some support for HTML 5 but it is not nearly as homogenous as you make it sound. All are missing some features of HTML5. If you look at our wiki page for HTML5 you will see a link at the bottom that lets you test your browser for compliance with HTML5. This will not only test your browser but show you the conditions of various browsers. The PC ones are the most compliant while mobile browsers have a long way to go. Similarly for eBook readers using ePub3 the PC versions are ahead while the mobile versions tend to lag but for eBook Reading the mobile versions are the most used. If you look in our wiki under ePub 3 you will find a list of eBook readers that have some support specific to ePub 3 but again not all features are present.

In addition there are some copyright details to be worked out. Audio and Video formats are not all in the public domain and some of the most popular are patented so it is not clear is browsers like mosaic based ones and eBook Readers will ever support all the available formats. MP3 I believe will expire its patents soon and will thus become eligible but video patents have more time to run.

Dale

carlosbcg
02-23-2013, 07:01 PM
Carlos: A browser does not an eBook reader make. Yes, most browsers have some support for HTML 5 but it is not nearly as homogenous as you make it sound.


I am beginning to realize that.


All are missing some features of HTML5. If you look at our wiki page for HTML5 you will see a link at the bottom that lets you test your browser for compliance with HTML5.


I guess I am kinda spoiled as Chromium (my browser of choice) scores 448 out of 500 which is plenty good enough.

But other browser start going downhill from there.

Carlos

DiapDealer
02-23-2013, 07:31 PM
Woohoo! Dolphin Browser (w/Dolphin Jetpack) scores a 481 on my Android device! ;)

carlosbcg
02-23-2013, 07:32 PM
Woohoo! Dolphin Browser (w/Dolphin Jetpack) scores a 481 on my Android device! ;)

Nice!

Carlos