When a script invokes the postMessage(message, targetOrigin, ports) method (with three arguments) on
a Window object, the user agent must follow these
steps:
If the value of the targetOrigin argument
is neither a single U+002A ASTERISK character (*), a single U+002F
SOLIDUS character (/), nor an absolute URL with a
<host-specific>
component that is either empty or a single U+002F SOLIDUS
character (/), then throw a SYNTAX_ERR exception and
abort the overall set of steps.
Let message clone be the result of obtaining a structured clone of the message argument. If this throws an exception, then throw that exception and abort these steps.
If the ports argument is empty, then act as if the method had just been called with two arguments, message and targetOrigin.
If any of the entries in ports are null, if
any MessagePort object is listed in ports more than once, or if any of the
MessagePort objects listed in ports have already been cloned once before, then
throw an INVALID_STATE_ERR exception.
Let new ports be an empty array.
For each port in ports in turn,
obtain a new port by cloning the
port with the Window object on which the method was
invoked as the owner of the clone, and append the clone to the
new ports array.
Return from the postMessage() method, but
asynchronously continue running these steps.
If the targetOrigin argument is a single
literal U+002F SOLIDUS character (/), and the
Document of the Window object on which
the method was invoked does not have the same origin
as the entry script's document, then abort these steps silently.
Otherwise, if the targetOrigin argument is
an absolute URL, and the Document of the
Window object on which the method was invoked does
not have the same origin as targetOrigin, then abort these steps silently.
Otherwise, the targetOrigin argument is a single literal U+002A ASTERISK character (*), and no origin check is made.
Create an event that uses the MessageEvent
interface, with the event name message, which does not bubble, is
not cancelable, and has no default action. The data attribute must be set to
the value of message clone, the origin attribute must be
set to the Unicode serialization of the origin of
the script that invoked the method, and the source attribute must be
set to the script's global object's
WindowProxy object.
Let the ports attribute
of the event be the new ports array.
Queue a task to dispatch the event created in the
previous step at the Window object on which the
method was invoked. The task source for this task is the posted message task
source.
These steps, with the exception of the third, fourth, and fifth steps and the penultimate step, are identical to those in the previous section.
This section is non-normative.
To enable independent pieces of code (e.g. running in different browsing contexts) to communicate directly, authors can use channel messaging.
Communication channels in this mechanisms are implemented as two-ways pipes, with a port at each end. Messages sent in one port are delivered at the other port, and vice-versa. Messages are asynchronous, and delivered as DOM events.
To create a connection (two "entangled" ports), the MessageChannel() constructor is called:
var channel = new MessageChannel();
One of the ports is kept as the local port, and the other port is
sent to the remote code, e.g. using postMessage():
otherWindow.postMessage('hello', 'http://example.com', [channel.port2]);
To send messages, the postMessage() method on
the port is used:
channel.port1.postMessage('hello');
To receive messages, one listens to message events:
channel.port1.onmessage = handleMessage;
function handleMessage(event) {
// message is in event.data
// ...
}
[Constructor]
interface MessageChannel {
readonly attribute MessagePort port1;
readonly attribute MessagePort port2;
};
MessageChannel()Returns a new MessageChannel object with two new MessagePort objects.
port1Returns the first MessagePort object.
port2Returns the second MessagePort object.
When the MessageChannel()
constructor is called, it must run the following algorithm:
Create a new MessagePort object
owned by the script's global object, and let port1 be that object.
Create a new MessagePort object
owned by the script's global object, and let port2 be that object.
Entangle the port1 and port2 objects.
Instantiate a new MessageChannel object, and
let channel be that object.
Let the port1
attribute of the channel object be port1.
Let the port2
attribute of the channel object be port2.
Return channel.
This constructor must be visible when the script's global
object is either a Window object or an object
implementing the WorkerUtils interface.
The port1 and
port2 attributes
must return the values they were assigned when the
MessageChannel object was created.
Each channel has two message ports. Data sent through one port is received by the other port, and vice versa.
typedef sequence<MessagePort> MessagePortArray;
interface MessagePort {
void postMessage(in any message, in optional MessagePortArray ports);
void start();
void close();
// event handlers
attribute Function onmessage;
};
MessagePort implements EventTarget;
postMessage(message [, ports] )Posts a message through the channel, optionally with the given ports.
Throws an INVALID_STATE_ERR if the ports array is not null and it contains either null
entries, duplicate ports, or the source or target port.
start()Begins dispatching messages received on the port.
close()Disconnects the port, so that it is no longer active.
Each MessagePort object can be entangled with
another (a symmetric relationship). Each MessagePort
object also has a task source called the port
message queue, initial empty. A port message
queue can be enabled or disabled, and is initially
disabled. Once enabled, a port can never be disabled again (though
messages in the queue can get moved to another queue or removed
altogether, which has much the same effect).
When the user agent is to create a new
MessagePort object owned by a script's
global object object owner, it must
instantiate a new MessagePort object, and let its owner
be owner.
When the user agent is to entangle two
MessagePort objects, it must run the following
steps:
If one of the ports is already entangled, then disentangle it and the port that it was entangled with.
If those two previously entangled ports were the
two ports of a MessageChannel object, then that
MessageChannel object no longer represents an actual
channel: the two ports in that object are no longer entangled.
Associate the two ports to be entangled, so that they form
the two parts of a new channel. (There is no
MessageChannel object that represents this
channel.)
When the user agent is to clone a port original port, with the clone being owned by owner, it must run the following steps, which return
a new MessagePort object. These steps must be run
atomically.
Create a new MessagePort object
owned by owner, and let new
port be that object.
Move all the events in the port message queue of original port to the port message queue of new port, if any, leaving the new port's port message queue in its initial disabled state.
If the original port is entangled with another port, then run these substeps:
Let the remote port be the port with which the original port is entangled.
Entangle the remote port and new port objects. The original port object will be disentangled by this process.
Return new port. It is the clone.
The postMessage()
method, when called on a port source port, must
cause the user agent to run the following steps:
Let target port be the port with which source port is entangled, if any.
If the method was called with a second argument ports and that argument isn't null, then, if any of
the entries in ports are null, if any
MessagePort object is listed in ports more than once, if any of the
MessagePort objects listed in ports have already been cloned once before, or if
any of the entries in ports are either the source port or the target port
(if any), then throw an INVALID_STATE_ERR
exception.
If there is no target port (i.e. if source port is not entangled), then abort these steps.
Create an event that uses the MessageEvent
interface, with the name message, which does not bubble, is not
cancelable, and has no default action.
Let message be the method's first argument.
Let message clone be the result of obtaining a structured clone of message. If this throws an exception, then throw that exception and abort these steps.
Let the data
attribute of the event have the value of message
clone.
If the method was called with a second argument ports and that argument isn't null, then run the following substeps:
Let new ports be an empty array.
For each port in ports in turn, obtain a new port by cloning the port with the owner of the target port as the owner of the clone, and append the clone to the new ports array.
If the original ports array was empty, then the new ports array will also be empty.
Let the ports
attribute of the event be the new ports
array.
Add the event to the port message queue of target port.
The start()
method must enable its port's port message queue, if it
is not already enabled.
When a port's port message queue is enabled, the event loop must use it as one of its task sources.
If the Document of the port's event
listeners' global object
is not fully active, then the messages are lost.
The close()
method, when called on a port local port that is
entangled with another port, must cause the user agents to
disentangle the two ports. If the method is called on a port that is
not entangled, then the method must do nothing.
The following are the event handlers (and their
corresponding event handler
event types) that must be supported, as IDL attributes, by
all objects implementing the MessagePort interface:
| Event handler | Event handler event type |
|---|---|
onmessage | message
|
The first time a MessagePort object's onmessage IDL attribute
is set, the port's port message queue must be enabled,
as if the start() method
had been called.
When a MessagePort object o is
entangled, user agents must either act as if o's
entangled MessagePort object has a strong reference to
o, or as if o's owner has a
strong reference to o.
Thus, a message port can be received, given an event listener, and then forgotten, and so long as that event listener could receive a message, the channel will be maintained.
Of course, if this was to occur on both sides of the channel, then both ports could be garbage collected, since they would not be reachable from live code, despite having a strong reference to each other.
Furthermore, a MessagePort object must not be
garbage collected while there exists a message in a task
queue that is to be dispatched on that
MessagePort object, or while the
MessagePort object's port message queue is
open and there exists a message
event in that queue.
Authors are strongly encouraged to explicitly close
MessagePort objects to disentangle them, so that their
resources can be recollected. Creating many MessagePort
objects and discarding them without closing them can lead to high
memory usage.
This section is non-normative.
This specification introduces two related mechanisms, similar to HTTP session cookies, for storing structured data on the client side. [COOKIES]
The first is designed for scenarios where the user is carrying out a single transaction, but could be carrying out multiple transactions in different windows at the same time.
Cookies don't really handle this case well. For example, a user could be buying plane tickets in two different windows, using the same site. If the site used cookies to keep track of which ticket the user was buying, then as the user clicked from page to page in both windows, the ticket currently being purchased would "leak" from one window to the other, potentially causing the user to buy two tickets for the same flight without really noticing.
To address this, this specification introduces the sessionStorage IDL attribute.
Sites can add data to the session storage, and it will be accessible
to any page from the same site opened in that window.
For example, a page could have a checkbox that the user ticks to indicate that he wants insurance:
<label> <input type="checkbox" onchange="sessionStorage.insurance = checked"> I want insurance on this trip. </label>
A later page could then check, from script, whether the user had checked the checkbox or not:
if (sessionStorage.insurance) { ... }
If the user had multiple windows opened on the site, each one would have its own individual copy of the session storage object.
The second storage mechanism is designed for storage that spans multiple windows, and lasts beyond the current session. In particular, Web applications may wish to store megabytes of user data, such as entire user-authored documents or a user's mailbox, on the client side for performance reasons.
Again, cookies do not handle this case well, because they are transmitted with every request.
The localStorage IDL
attribute is used to access a page's local storage area.
The site at example.com can display a count of how many times the user has loaded its page by putting the following at the bottom of its page:
<p>
You have viewed this page
<span id="count">an untold number of</span>
time(s).
</p>
<script>
if (!localStorage.pageLoadCount)
localStorage.pageLoadCount = 0;
localStorage.pageLoadCount += 1;
document.getElementById('count').textContent = localStorage.pageLoadCount;
</script>
Each site has its own separate storage area.
Storage interfaceinterface Storage {
readonly attribute unsigned long length;
getter DOMString key(in unsigned long index);
getter any getItem(in DOMString key);
setter creator void setItem(in DOMString key, in any data);
deleter void removeItem(in DOMString key);
void clear();
};
Each Storage object provides access to a list of
key/value pairs, which are sometimes called items. Keys are
strings. Any string (including the empty string) is a valid
key. Values can be any data type supported by the structured
clone algorithm.
Each Storage object is associated with a list of
key/value pairs when it is created, as defined in the sections on
the sessionStorage and localStorage attributes. Multiple
separate objects implementing the Storage interface can
all be associated with the same list of key/value pairs
simultaneously.
The object's indices of the supported indexed properties are the numbers in the range zero to one less than the number of key/value pairs currently present in the list associated with the object. If the list is empty, then there are no supported indexed properties.
The length
attribute must return the number of key/value pairs currently
present in the list associated with the object.
The key(n) method must return the name of the
nth key in the list. The order of keys is
user-agent defined, but must be consistent within an object so long
as the number of keys doesn't change. (Thus, adding or removing a key may change the
order of the keys, but merely changing the value of an existing key
must not.) If n is greater than or equal to the number of key/value pairs
in the object, then this method must return null.
The names of the supported named properties on a
Storage object are the keys of each key/value pair
currently present in the list associated with the object.
The getItem(key) method must return a
structured clone of the current value associated with
the given key. If the given key does not exist in the list associated with the
object then this method must return null.
The setItem(key, value) method
must first create a structured clone of the given value. If this raises an exception, then the
exception must be thrown and the list associated with the object is
left unchanged. If constructing the stuctured clone would involve
constructing a new ImageData object, then throw a
NOT_SUPPORTED_ERR exception instead.
Otherwise, the user agent must then check if a key/value pair with the given key already exists in the list associated with the object.
If it does not, then a new key/value pair must be added to the list, with the given key and with its value set to the newly obtained clone of value.
If the given key does exist in the list, then it must have its value updated to the newly obtained clone of value.
If it couldn't set the new value, the method must raise an
QUOTA_EXCEEDED_ERR exception. (Setting could fail if,
e.g., the user has disabled storage for the site, or if the quota
has been exceeded.)
The removeItem(key) method must cause the key/value
pair with the given key to be removed from the
list associated with the object, if it exists. If no item with that
key exists, the method must do nothing.
The setItem() and removeItem() methods must be
atomic with respect to failure. In the case of failure, the method
does nothing. That is, changes to the data storage area must either
be successful, or the data storage area must not be changed at
all.
The clear()
method must atomically cause the list associated with the object to
be emptied of all key/value pairs, if there are any. If there are
none, then the method must do nothing.
When the setItem(), removeItem(), and clear() methods are invoked, events
are fired on other HTMLDocument objects that can access
the newly stored or removed data, as defined in the sections on the
sessionStorage and localStorage attributes.
This specification does not require that the above methods wait until the data has been physically written to disk. Only consistency in what different scripts accessing the same underlying list of key/value pairs see is required.
sessionStorage attribute[Supplemental, NoInterfaceObject] interface WindowSessionStorage { readonly attribute Storage sessionStorage; }; Window implements WindowSessionStorage;
The sessionStorage
attribute represents the set of storage areas specific to the
current top-level browsing context.
Each top-level browsing context has a unique set of session storage areas, one for each origin.
User agents should not expire data from a browsing context's session storage areas, but may do so when the user requests that such data be deleted, or when the UA detects that it has limited storage space, or for security reasons. User agents should always avoid deleting data while a script that could access that data is running. When a top-level browsing context is destroyed (and therefore permanently inaccessible to the user) the data stored in its session storage areas can be discarded with it, as the API described in this specification provides no way for that data to ever be subsequently retrieved.
The lifetime of a browsing context can be unrelated to the lifetime of the actual user agent process itself, as the user agent may support resuming sessions after a restart.
When a new HTMLDocument is created, the user agent
must check to see if the document's top-level browsing
context has allocated a session storage area for that
document's origin. If it has not, a new storage area
for that document's origin must be created.
The sessionStorage
attribute must return the Storage object associated
with that session storage area. Each Document object
must have a separate object for its Window's sessionStorage attribute.
When a new top-level browsing context is created by cloning an existing browsing context, the new browsing context must start with the same session storage areas as the original, but the two sets must from that point on be considered separate, not affecting each other in any way.
When a new top-level browsing context is created by
a script in an existing
browsing context, or by the user following a link in an
existing browsing context, or in some other way related to a
specific HTMLDocument, then the session storage area of
the origin of that HTMLDocument must be
copied into the new browsing context when it is created. From that
point on, however, the two session storage areas must be considered
separate, not affecting each other in any way.
When the setItem(), removeItem(), and clear() methods are called on a
Storage object x that is associated
with a session storage area, if the methods did something, then in
every HTMLDocument object whose Window
object's sessionStorage
attribute's Storage object is associated with the same
storage area, other than x, a storage event must be fired, as described below.
localStorage attribute[Supplemental, NoInterfaceObject] interface WindowLocalStorage { readonly attribute Storage localStorage; }; Window implements WindowLocalStorage;
The localStorage
object provides a Storage object for an
origin.
User agents must have a set of local storage areas, one for each origin.
User agents should expire data from the local storage areas only for security reasons or when requested to do so by the user. User agents should always avoid deleting data while a script that could access that data is running.
When the localStorage
attribute is accessed, the user agent must run the following steps:
The user agent may throw a SECURITY_ERR
exception instead of returning a Storage object if the
request violates a policy decision (e.g. if the user agent is
configured to not allow the page to persist data).
If the Document's effective script
origin is not the same origin as the
Document's origin, then throw a
SECURITY_ERR exception and abort these steps.
If the Document's origin is not a
scheme/host/port tuple, then throw a SECURITY_ERR
exception and abort these steps.
Check to see if the user agent has allocated a local storage
area for the origin of the Document of
the Window object on which the method was invoked. If
it has not, create a new storage area for that
origin.
Return the Storage object associated with that
origin's local storage area. Each Document object must
have a separate object for its Window's localStorage attribute.
When the setItem(), removeItem(), and clear() methods are called on a
Storage object x that is associated
with a local storage area, if the methods did something, then in
every HTMLDocument object whose Window
object's localStorage
attribute's Storage object is associated with the same
storage area, other than x, a storage event must be fired, as described below.
Whenever the properties of a localStorage attribute's
Storage object are to be examined, returned, set, or
deleted, whether as part of a direct property access, when checking
for the presence of a property, during property enumeration, when
determining the number of properties present, or as part of the
execution of any of the methods or attributes defined on the
Storage interface. the user agent must first
obtain the storage mutex.
storage eventThe storage event
is fired when a storage area changes, as described in the previous
two sections (for session
storage, for local
storage).
When this happens, the user agent must queue a task
to fire an event with the name storage, which does not
bubble and is not cancelable, and which uses the
StorageEvent interface, at each Window
object whose Document object has a Storage
object that is affected.
This includes Document objects that are
not fully active, but events fired on those are ignored
by the event loop until the Document
becomes fully active again.
The task source for this task is the DOM manipulation task source.
If the event is being fired due to an invocation of the setItem() or removeItem() methods, the
event must have its key
attribute set to the name of the key in question, its oldValue attribute set to a
structured clone of the old value of the key in
question, or null if the key is newly added, and its newValue attribute set to a
structured clone of the new value of the key in
question, or null if the key was removed.
Otherwise, if the event is being fired due to an invocation of
the clear() method, the event
must have its key, oldValue, and newValue attributes set to
null.
In addition, the event must have its url attribute set to the address of the document
whose Storage object was affected; and its storageArea attribute
set to the Storage object from the Window
object of the target Document that represents the same
kind of Storage area as was affected (i.e. session or
local).
interface StorageEvent : Event {
readonly attribute DOMString key;
readonly attribute any oldValue;
readonly attribute any newValue;
readonly attribute DOMString url;
readonly attribute Storage storageArea;
void initStorageEvent(in DOMString typeArg, in boolean canBubbleArg, in boolean cancelableArg, in DOMString keyArg, in any oldValueArg, in any newValueArg, in DOMString urlArg, in Storage storageAreaArg);
};
The initStorageEvent()
method must initialize the event in a manner analogous to the
similarly-named method in the DOM Events interfaces. [DOMEVENTS]
The key
attribute represents the key being changed.
The oldValue
attribute represents the old value of the key being changed.
The newValue
attribute represents the new value of the key being changed.
The url
attribute represents the address of the document whose key
changed.
The storageArea
attribute represents the Storage object that was
affected.
Because of the use of the storage mutex, multiple browsing contexts will be able to access the local storage areas simultaneously in such a manner that scripts cannot detect any concurrent script execution.
Thus, the length
attribute of a Storage object, and the value of the
various properties of that object, cannot change while a script is
executing, other than in a way that is predictable by the script
itself.
User agents should limit the total amount of space allowed for storage areas.
User agents should guard against sites storing data under the origins other affiliated sites, e.g. storing up to the limit in a1.example.com, a2.example.com, a3.example.com, etc, circumventing the main example.com storage limit.
User agents may prompt the user when quotas are reached, allowing the user to grant a site more space. This enables sites to store many user-created documents on the user's computer, for instance.
User agents should allow users to see how much space each domain is using.
A mostly arbitrary limit of five megabytes per origin is recommended. Implementation feedback is welcome and will be used to update this suggestion in the future.
A third-party advertiser (or any entity capable of getting content distributed to multiple sites) could use a unique identifier stored in its local storage area to track a user across multiple sessions, building a profile of the user's interests to allow for highly targeted advertising. In conjunction with a site that is aware of the user's real identity (for example an e-commerce site that requires authenticated credentials), this could allow oppressive groups to target individuals with greater accuracy than in a world with purely anonymous Web usage.
There are a number of techniques that can be used to mitigate the risk of user tracking:
User agents may restrict access to
the localStorage objects
to scripts originating at the domain of the top-level document of
the browsing context, for instance denying access to
the API for pages from other domains running in
iframes.
User agents may, if so configured by the user, automatically delete stored data after a period of time.
For example, a user agent could be configured to treat third-party local storage areas as session-only storage, deleting the data once the user had closed all the browsing contexts that could access it.
This can restrict the ability of a site to track a user, as the site would then only be able to track the user across multiple sessions when he authenticates with the site itself (e.g. by making a purchase or logging in to a service).
However, this also reduces the usefulness of the API as a long-term storage mechanism. It can also put the user's data at risk, if the user does not fully understand the implications of data expiration.
If users attempt to protect their privacy by clearing cookies without also clearing data stored in the local storage area, sites can defeat those attempts by using the two features as redundant backup for each other. User agents should present the interfaces for clearing these in a way that helps users to understand this possibility and enables them to delete data in all persistent storage features simultaneously. [COOKIES]
User agents may allow sites to access session storage areas in an unrestricted manner, but require the user to authorize access to local storage areas.
User agents may record the origins of sites that contained content from third-party origins that caused data to be stored.
If this information is then used to present the view of data currently in persistent storage, it would allow the user to make informed decisions about which parts of the persistent storage to prune. Combined with a blacklist ("delete this data and prevent this domain from ever storing data again"), the user can restrict the use of persistent storage to sites that he trusts.
User agents may allow users to share their persistent storage domain blacklists.
This would allow communities to act together to protect their privacy.
While these suggestions prevent trivial use of this API for user tracking, they do not block it altogether. Within a single domain, a site can continue to track the user during a session, and can then pass all this information to the third party along with any identifying information (names, credit card numbers, addresses) obtained by the site. If a third party cooperates with multiple sites to obtain such information, a profile can still be created.
However, user tracking is to some extent possible even with no cooperation from the user agent whatsoever, for instance by using session identifiers in URLs, a technique already commonly used for innocuous purposes but easily repurposed for user tracking (even retroactively). This information can then be shared with other sites, using using visitors' IP addresses and other user-specific data (e.g. user-agent headers and configuration settings) to combine separate sessions into coherent user profiles.
User agents should treat persistently stored data as potentially sensitive; it's quite possible for e-mails, calendar appointments, health records, or other confidential documents to be stored in this mechanism.
To this end, user agents should ensure that when deleting data, it is promptly deleted from the underlying storage.
Because of the potential for DNS spoofing attacks, one cannot guarantee that a host claiming to be in a certain domain really is from that domain. To mitigate this, pages can use SSL. Pages using SSL can be sure that only pages using SSL that have certificates identifying them as being from the same domain can access their storage areas.
Different authors sharing one host name, for example users
hosting content on geocities.com, all share one
local storage object.
There is no feature to restrict the access by pathname. Authors on
shared hosts are therefore recommended to avoid using these
features, as it would be trivial for other authors to read the data
and overwrite it.
Even if a path-restriction feature was made available, the usual DOM scripting security model would make it trivial to bypass this protection and access the data from any path.
The two primary risks when implementing these persistent storage features are letting hostile sites read information from other domains, and letting hostile sites write information that is then read from other domains.
Letting third-party sites read data that is not supposed to be read from their domain causes information leakage, For example, a user's shopping wishlist on one domain could be used by another domain for targeted advertising; or a user's work-in-progress confidential documents stored by a word-processing site could be examined by the site of a competing company.
Letting third-party sites write data to the persistent storage of other domains can result in information spoofing, which is equally dangerous. For example, a hostile site could add items to a user's wishlist; or a hostile site could set a user's session identifier to a known ID that the hostile site can then use to track the user's actions on the victim site.
Thus, strictly following the origin model described in this specification is important for user security.
This section only describes the rules for resources labeled with an HTML MIME type. Rules for XML resources are discussed in the section below entitled "The XHTML syntax".
This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
Documents must consist of the following parts, in the given order:
html element.The various types of content mentioned above are described in the next few sections.
In addition, there are some restrictions on how character encoding declarations are to be serialized, as discussed in the section on that topic.
Space characters before the root html element, and
space characters at the start of the html element and
before the head element, will be dropped when the
document is parsed; space characters after the root
html element will be parsed as if they were at the end
of the body element. Thus, space characters around the
root element do not round-trip.
It is suggested that newlines be inserted after the DOCTYPE,
after any comments that are before the root element, after the
html element's start tag (if it is not omitted), and after any comments
that are inside the html element but before the
head element.
Many strings in the HTML syntax (e.g. the names of elements and their attributes) are case-insensitive, but only for characters in the ranges U+0041 to U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and U+0061 to U+007A (LATIN SMALL LETTER A to LATIN SMALL LETTER Z). For convenience, in this section this is just referred to as "case-insensitive".
A DOCTYPE is a mostly useless, but required, header.
DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
A DOCTYPE must consist of the following characters, in this order:
<!DOCTYPE".HTML".In other words, <!DOCTYPE HTML>,
case-insensitively.
For the purposes of HTML generators that cannot output HTML
markup with the short DOCTYPE "<!DOCTYPE
HTML>", a DOCTYPE legacy string may be inserted
into the DOCTYPE (in the position defined above). This string must
consist of:
SYSTEM".about:legacy-compat".In other words, <!DOCTYPE HTML SYSTEM
"about:legacy-compat"> or <!DOCTYPE HTML SYSTEM
'about:legacy-compat'>, case-insensitively except for the bit
in single or double quotes.
The DOCTYPE legacy string should not be used unless the document is generated from a system that cannot output the shorter string.
To help authors transition from HTML4 and XHTML1, an obsolete permitted DOCTYPE string can be inserted into the DOCTYPE (in the position defined above). This string must consist of:
PUBLIC".| Public identifier | System identifier |
|---|---|
-//W3C//DTD HTML 4.0//EN
| |
-//W3C//DTD HTML 4.0//EN
| http://www.w3.org/TR/REC-html40/strict.dtd
|
-//W3C//DTD HTML 4.01//EN
| |
-//W3C//DTD HTML 4.01//EN
| http://www.w3.org/TR/html4/strict.dtd
|
-//W3C//DTD XHTML 1.0 Strict//EN
| http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
|
-//W3C//DTD XHTML 1.1//EN
| http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
|
A DOCTYPE containing an obsolete permitted DOCTYPE string is an obsolete permitted DOCTYPE. Authors should not use obsolete permitted DOCTYPEs, as they are unnecessarily long.
There are five different kinds of elements: void elements, raw text elements, RCDATA elements, foreign elements, and normal elements.
area, base, br,
col, command, embed,
hr, img, input,
keygen, link, meta,
param, sourcescript, styletextarea, titleTags are used to delimit the start and end of elements in the markup. Raw text, RCDATA, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described later. Those that cannot be omitted must not be omitted. Void elements only have a start tag; end tags must not be specified for void elements. Foreign elements must either have a start tag and an end tag, or a start tag that is marked as self-closing, in which case they must not have an end tag.
The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depends on the content model of that element, as described earlier in this specification. Elements must not contain content that their content model disallows. In addition to the restrictions placed on the contents by those content models, however, the five types of elements have additional syntactic requirements.
Void elements can't have any contents (since there's no end tag, no content can be put between the start tag and the end tag).
Raw text elements can have text, though it has restrictions described below.
RCDATA elements can have text and character references, but the text must not contain an ambiguous ampersand. There are also further restrictions described below.
Foreign elements whose start tag is marked as self-closing can't have any contents (since, again, as there's no end tag, no content can be put between the start tag and the end tag). Foreign elements whose start tag is not marked as self-closing can have text, character references, CDATA sections, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand.
The HTML syntax does not support namespace declarations, even in foreign elements.
For instance, consider the following HTML fragment:
<p> <svg> <metadata> <!-- this is invalid --> <cdr:license xmlns:cdr="http://www.example.com/cdr/metadata" name="MIT"/> </metadata> </svg> </p>
The innermost element, cdr:license, is
actually in the SVG namespace, as the "xmlns:cdr" attribute has no effect (unlike in
XML). In fact, as the comment in the fragment above says, the
fragment is actually non-conforming. This is because the SVG
specification does not define any elements called "cdr:license" in the SVG namespace.
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
Tags contain a tag name, giving the element's name. HTML elements all have names that only use characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
Start tags must have the following format:
End tags must have the following format:
Attributes for an element are expressed inside the element's start tag.
Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Attributes can be specified in four different ways:
Just the attribute name. The value is implicitly the empty string.
In the following example, the disabled attribute is given with
the empty attribute syntax:
<input disabled>
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.
In the following example, the value attribute is given
with the unquoted attribute value syntax:
<input value=yes>
If an attribute using the unquoted attribute syntax is to be followed by another attribute or by the optional U+002F SOLIDUS character (/) allowed in step 6 of the start tag syntax above, then there must be a space character separating the two.
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single U+0027 APOSTROPHE character ('), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0027 APOSTROPHE characters ('), and finally followed by a second single U+0027 APOSTROPHE character (').
In the following example, the type attribute is given with the
single-quoted attribute value syntax:
<input type='checkbox'>
If an attribute using the single-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single U+0022 QUOTATION MARK character ("), followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters ("), and finally followed by a second single U+0022 QUOTATION MARK character (").
In the following example, the name attribute is given with the
double-quoted attribute value syntax:
<input name="be evil">
If an attribute using the double-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other.
When a foreign element has one of the namespaced attributes given by the local name and namespace of the first and second cells of a row from the following table, it must be written using the name given by the third cell from the same row.
| Local name | Namespace | Attribute name |
|---|---|---|
actuate | XLink namespace | xlink:actuate
|
arcrole | XLink namespace | xlink:arcrole
|
href | XLink namespace | xlink:href
|
role | XLink namespace | xlink:role
|
show | XLink namespace | xlink:show
|
title | XLink namespace | xlink:title
|
type | XLink namespace | xlink:type
|
base | XML namespace | xml:base
|
lang | XML namespace | xml:lang
|
space | XML namespace | xml:space
|
xmlns | XMLNS namespace | xmlns
|
xlink | XMLNS namespace | xmlns:xlink
|
No other namespaced attribute can be expressed in the HTML syntax.
Certain tags can be omitted.
Omitting an element's start tag does not mean the element
is not present; it is implied, but it is still there. An HTML
document always has a root html element, even if the
string <html> doesn't appear anywhere in
the markup.
An html element's start tag may be omitted if the
first thing inside the html element is not a comment.
An html element's end
tag may be omitted if the html element is not
immediately followed by a comment.
A head element's start tag may be omitted if the
element is empty, or if the first thing inside the
head element is an element.
A head element's end
tag may be omitted if the head element is not
immediately followed by a space character or a comment.
A body element's start tag may be omitted if the
element is empty, or if the first thing inside the body
element is not a space character or a comment, except if the first thing
inside the body element is a script or
style element.
A body element's end
tag may be omitted if the body element is not
immediately followed by a comment.
A li element's end
tag may be omitted if the li element is
immediately followed by another li element or if there
is no more content in the parent element.
A dt element's end
tag may be omitted if the dt element is
immediately followed by another dt element or a
dd element.
A dd element's end
tag may be omitted if the dd element is
immediately followed by another dd element or a
dt element, or if there is no more content in the
parent element.
A p element's end
tag may be omitted if the p element is
immediately followed by an address,
article, aside, blockquote,
dir,
div, dl, fieldset,
footer, form, h1,
h2, h3, h4, h5,
h6, header, hgroup,
hr, menu, nav,
ol, p, pre,
section, table, or ul,
element, or if there is no more content in the parent element and
the parent element is not an a element.
An rt element's end
tag may be omitted if the rt element is
immediately followed by an rt or rp
element, or if there is no more content in the parent element.
An rp element's end
tag may be omitted if the rp element is
immediately followed by an rt or rp
element, or if there is no more content in the parent element.
An optgroup element's end tag may be omitted if the
optgroup element is immediately followed by
another optgroup element, or if there is no
more content in the parent element.
An option element's end
tag may be omitted if the option element is
immediately followed by another option element, or if
it is immediately followed by an optgroup element, or
if there is no more content in the parent element.
A colgroup element's start tag may be omitted if the
first thing inside the colgroup element is a
col element, and if the element is not immediately
preceded by another colgroup element whose end tag has been omitted. (It can't be
omitted if the element is empty.)
A colgroup element's end tag may be omitted if the
colgroup element is not immediately followed by a
space character or a comment.
A thead element's end
tag may be omitted if the thead element is
immediately followed by a tbody or tfoot
element.
A tbody element's start tag may be omitted if the
first thing inside the tbody element is a
tr element, and if the element is not immediately
preceded by a tbody, thead, or
tfoot element whose end
tag has been omitted. (It can't be omitted if the element is
empty.)
A tbody element's end
tag may be omitted if the tbody element is
immediately followed by a tbody or tfoot
element, or if there is no more content in the parent element.
A tfoot element's end
tag may be omitted if the tfoot element is
immediately followed by a tbody element, or if there is
no more content in the parent element.
A tr element's end
tag may be omitted if the tr element is
immediately followed by another tr element, or if there
is no more content in the parent element.
A td element's end
tag may be omitted if the td element is
immediately followed by a td or th
element, or if there is no more content in the parent element.
A th element's end
tag may be omitted if the th element is
immediately followed by a td or th
element, or if there is no more content in the parent element.
However, a start tag must never be omitted if it has any attributes.
For historical reasons, certain elements have extra restrictions beyond even the restrictions given by their content model.
A table element must not contain tr
elements, even though these elements are technically allowed inside
table elements according to the content models
described in this specification. (If a tr element is
put inside a table in the markup, it will in fact imply
a tbody start tag before it.)
A single newline may be
placed immediately after the start
tag of pre and textarea
elements. This does not affect the processing of the element. The
otherwise optional newline
must be included if the element's contents themselves start
with a newline (because
otherwise the leading newline in the contents would be treated like
the optional newline, and ignored).
The text in raw text and
RCDATAs element must not
contain any occurrences of the string "</"
(U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that
case-insensitively match the tag name of the element followed by one
of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM
FEED (FF), U+000D CARRIAGE RETURN (CR), U+0020 SPACE, U+003E
GREATER-THAN SIGN (>), or U+002F SOLIDUS (/).
Text is allowed inside elements, attributes, and comments. Text must consist of Unicode characters. Text must not contain U+0000 characters. Text must not contain permanently undefined Unicode characters (noncharacters). Text must not contain control characters other than space characters. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections.
Newlines in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.
In certain cases described in other sections, text may be mixed with character references. These can be used to escape characters that couldn't otherwise legally be included in text.
Character references must start with a U+0026 AMPERSAND character (&). Following this, there are three possible kinds of character references:
The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, permanently undefined Unicode characters (noncharacters), and control characters other than space characters.
An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by some text other than a space character, a U+003C LESS-THAN SIGN character (<), or another U+0026 AMPERSAND character (&).
CDATA sections must start with
the character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
MARK, U+005B LEFT SQUARE BRACKET, U+0043 LATIN CAPITAL LETTER C,
U+0044 LATIN CAPITAL LETTER D, U+0041 LATIN CAPITAL LETTER A, U+0054
LATIN CAPITAL LETTER T, U+0041 LATIN CAPITAL LETTER A, U+005B LEFT
SQUARE BRACKET (<![CDATA[). Following this
sequence, the CDATA section may have text, with the additional restriction
that the text must not contain the three character sequence U+005D
RIGHT SQUARE BRACKET, U+005D RIGHT SQUARE BRACKET, U+003E
GREATER-THAN SIGN (]]>). Finally, the CDATA
section must be ended by the three character sequence U+005D RIGHT
SQUARE BRACKET, U+005D RIGHT SQUARE BRACKET, U+003E GREATER-THAN
SIGN (]]>).
CDATA sections can only be used in foreign content (MathML or
SVG). In this example, a CDATA section is used to escape the
contents of an ms element:
<p>You can add a string to a number, but this stringifies the number:</p> <math> <ms><![CDATA[x<y]]></ms> <mo>+</mo> <mn>3</mn> <mo>=</mo> <ms><![CDATA[x<y3]]></ms> </math>
Comments must start with the
four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--). Following this sequence, the comment may
have text, with the additional
restriction that the text must not start with a single U+003E
GREATER-THAN SIGN character (>), nor start with a U+002D
HYPHEN-MINUS character (-) followed by a U+003E GREATER-THAN SIGN
(>) character, nor contain two consecutive U+002D HYPHEN-MINUS
characters (--), nor end with a U+002D
HYPHEN-MINUS character (-). Finally, the comment must be ended by
the three character sequence U+002D HYPHEN-MINUS, U+002D
HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->).
This section only applies to user agents, data mining tools, and conformance checkers.
The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XHTML syntax".
For HTML documents, user agents must use the parsing rules described in this section to generate the DOM trees. Together, these rules define what is referred to as the HTML parser.
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed Web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML tools and the XML serialization of HTML.
This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must abort processing at the first error that they encounter for which they do not wish to apply the rules described below.
Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document. Conformance checkers are not required to recover from parse errors.
Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.
For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax, then it is an HTML document.
The input to the HTML parsing process consists of a stream of
Unicode characters, which is passed through a
tokenization stage followed by a tree
construction stage. The output is a Document
object.
Implementations that do not
support scripting do not have to actually create a DOM
Document object, but the DOM tree in such cases is
still used as the model for the rest of the specification.
In the common case, the data handled by the tokenization stage
comes from the network, but it can also come from script, e.g. using the document.write() API.

There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.
In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" start tag token:
...
<script>
document.write('<p>');
</script>
...
To handle these cases, parsers have a script nesting level, which must be initially set to zero, and a parser pause flag, which must be initially set to false.
The stream of Unicode characters that comprises the input to the tokenization stage will be initially seen by the user agent as a stream of bytes (typically coming over the network or from the local file system). The bytes encode the actual characters according to a particular character encoding, which the user agent must use to decode the bytes into characters.
For XML documents, the algorithm user agents must use to determine the character encoding is given by the XML specification. This section does not apply to XML documents. [XML]
In some cases, it might be impractical to unambiguously determine the encoding before parsing the document. Because of this, this specification provides for a two-pass mechanism with an optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing algorithm to whatever bytes they have available before beginning to parse the document. Then, the real parser is started, using a tentative encoding derived from this pre-parse and other out-of-band metadata. If, while the document is being loaded, the user agent discovers an encoding declaration that conflicts with this information, then the parser can get reinvoked to perform a parse of the document with the real encoding.
User agents must use the following algorithm (the encoding sniffing algorithm) to determine the character encoding to use when decoding a document in the first pass. This algorithm takes as input any out-of-band metadata available to the user agent (e.g. the Content-Type metadata of the document) and all the bytes available so far, and returns an encoding and a confidence. The confidence is either tentative, certain, or irrelevant. The encoding used, and whether the confidence in that encoding is tentative or certain, is used during the parsing to determine whether to change the encoding. If no encoding is necessary, e.g. because the parser is operating on a stream of Unicode characters and doesn't have to use an encoding at all, then the confidence is irrelevant.
If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.
The user agent may wait for more bytes of the resource to be available, either in this step or at any later step in this algorithm. For instance, a user agent might wait 500ms or 512 bytes, whichever came first. In general preparsing the source to find the encoding improves performance, as it reduces the need to throw away the data structures used when parsing upon finding the encoding information. However, if the user agent delays too long to obtain data to determine the encoding, then the cost of the delay could outweigh any performance improvements from the preparse.
For each of the rows in the following table, starting with the first one and going down, if there are as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then return the encoding given in the cell in the second column of that row, with the confidence certain, and abort these steps:
| Bytes in Hexadecimal | Encoding |
|---|---|
| FE FF | UTF-16BE |
| FF FE | UTF-16LE |
| EF BB BF | UTF-8 |
This step looks for Unicode Byte Order Marks (BOMs).
Otherwise, the user agent will have to search for explicit character encoding information in the file itself. This should proceed as follows:
Let position be a pointer to a byte in the input stream, initially pointing at the first byte. If at any point during these substeps the user agent either runs out of bytes or decides that scanning further bytes would not be efficient, then skip to the next step of the overall character encoding detection algorithm. User agents may decide that scanning any bytes is not efficient, in which case these substeps are entirely skipped.
Now, repeat the following "two" steps until the algorithm aborts (either because user agent aborts, as described above, or because a character encoding is found):
If position points to:
Advance the position pointer so that it points at the first 0x3E byte which is preceded by two 0x2D bytes (i.e. at the end of an ASCII '-->' sequence) and comes after the 0x3C byte that was found. (The two 0x2D bytes can be the same as the those in the '<!--' sequence.)
Advance the position pointer so that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or 0x2F byte (the one in sequence of characters matched above).
Get an attribute and its value. If no attribute was sniffed, then skip this inner set of steps, and jump to the second step in the overall "two step" algorithm.
If the attribute's name is neither "charset" nor "content",
then return to step 2 in these inner steps.
If the attribute's name is "charset", let charset be
the attribute's value, interpreted as a character
encoding.
Otherwise, the attribute's name is "content": apply the algorithm for
extracting an encoding from a Content-Type, giving the
attribute's value as the string to parse. If an encoding is
returned, let charset be that
encoding. Otherwise, return to step 2 in these inner
steps.
If charset is a UTF-16 encoding, change the value of charset to UTF-8.
If charset is a supported character encoding, then return the given encoding, with confidence tentative, and abort all these steps.
Otherwise, return to step 2 in these inner steps.
Advance the position pointer so that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII >) byte.
Repeatedly get an attribute until no further attributes can be found, then jump to the second step in the overall "two step" algorithm.
Advance the position pointer so that it points at the first 0x3E byte (ASCII >) that comes after the 0x3C byte that was found.
Do nothing with that byte.
When the above "two step" algorithm says to get an attribute, it means doing this:
If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x2F (ASCII /) then advance position to the next byte and redo this substep.
If the byte at position is 0x3E (ASCII >), then abort the "get an attribute" algorithm. There isn't one.
Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string.
Attribute name: Process the byte at position as follows:
Advance position to the next byte and return to the previous step.
Spaces: If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.
If the byte at position is not 0x3D (ASCII =), abort the "get an attribute" algorithm. The attribute's name is the value of attribute name, its value is the empty string.
Advance position past the 0x3D (ASCII =) byte.
Value: If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.
Process the byte at position as follows:
Process the byte at position as follows:
Advance position to the next byte and return to the previous step.
For the sake of interoperability, user agents should not use a pre-scan algorithm that returns different results than the one described above. (But, if you do, please at least let us know, so that we can improve this algorithm and benefit everyone...)
If the user agent has information on the likely encoding for this page, e.g. based on the encoding of the page when it was last visited, then return that encoding, with the confidence tentative, and abort these steps.
The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. Such algorithms may use information about the resource other than the resource's contents, including the address of the resource. If autodetection succeeds in determining a character encoding, then return that encoding, with the confidence tentative, and abort these steps. [UNIVCHARDET]
The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding. [PPUTF8] [UTF8DET]
Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative.
In controlled environments or in environments where the
encoding of documents can be prescribed (for example, for user
agents intended for dedicated use in new networks), the
comprehensive UTF-8 encoding is
suggested.
In other environments, the default encoding is typically dependent on the user's locale (an approximation of the languages, and thus often encodings, of the pages that the user is likely to frequent). The following table gives suggested defaults based on the user's locale, for compatibility with legacy content. Locales are identified by BCP 47 language codes. [BCP47]
| Locale language | Suggested default encoding |
|---|---|
| ar | UTF-8 |
| be | ISO-8859-5 |
| bg | windows-1251 |
| cs | ISO-8859-2 |
| cy | UTF-8 |
| fa | UTF-8 |
| he | windows-1255 |
| hr | UTF-8 |
| hu | ISO-8859-2 |
| ja | Windows-31J |
| kk | UTF-8 |
| ko | windows-949 |
| ku | windows-1254 |
| lt | windows-1257 |
| lv | ISO-8859-13 |
| mk | UTF-8 |
| or | UTF-8 |
| pl | ISO-8859-2 |
| ro | UTF-8 |
| ru | windows-1251 |
| sk | windows-1250 |
| sl | ISO-8859-2 |
| sr | UTF-8 |
| th | windows-874 |
| tr | windows-1254 |
| uk | windows-1251 |
| vi | UTF-8 |
| zh-CN | GB18030 |
| zh-TW | Big5 |
| All other locales | windows-1252 |
The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input stream.
This algorithm is a willful violation of the HTTP specification, which requires that the encoding be assumed to be ISO-8859-1 in the absence of a character encoding declaration to the contrary, and of RFC 2046, which requires that the encoding be assumed to be US-ASCII in the absence of a character encoding declaration to the contrary. This specification's third approach is motivated by a desire to be maximally compatible with legacy content. [HTTP] [RFC2046]
User agents must at a minimum support the UTF-8 and Windows-1252 encodings, but may support more.
It is not unusual for Web browsers to support dozens if not upwards of a hundred distinct character encodings.
User agents must support the preferred MIME name of every character encoding they support, and should support all the IANA-registered names and aliases of every character encoding they support. [IANACHARSET]
When comparing a string specifying a character encoding with the name or alias of a character encoding to determine if they are equal, user agents must remove any leading or trailing space characters in both names, and then perform the comparison in an ASCII case-insensitive manner.
When a user agent would otherwise use an encoding given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row. When a byte or sequence of bytes is treated differently due to this encoding aliasing, it is said to have been misinterpreted for compatibility.
| Input encoding | Replacement encoding | References |
|---|---|---|
| EUC-KR | windows-949 | [EUCKR] [WIN949] |
| GB2312 | GBK | [RFC1345] [GBK] |
| GB_2312-80 | GBK | [RFC1345] [GBK] |
| ISO-8859-1 | windows-1252 | [RFC1345] [WIN1252] |
| ISO-8859-9 | windows-1254 | [RFC1345] [WIN1254] |
| ISO-8859-11 | windows-874 | [ISO885911] [WIN874] |
| KS_C_5601-1987 | windows-949 | [RFC1345] [WIN949] |
| Shift_JIS | Windows-31J | [SHIFTJIS] [WIN31J] |
| TIS-620 | windows-874 | [TIS620] [WIN874] |
| US-ASCII | windows-1252 | [RFC1345] [WIN1252] |
The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification, motivated by a desire for compatibility with legacy content. [CHARMOD]
When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE.
The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781]
User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings. [CESU8] [UTF7] [BOCU1] [SCSU]
Support for encodings based on EBCDIC is not recommended. This encoding is rarely used for publicly-facing Web content.
Support for UTF-32 is not recommended. This encoding is rarely used, and frequently implemented incorrectly.
This specification does not make any attempt to support EBCDIC-based encodings and UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior in implementations of this specification.
Given an encoding, the bytes in the input stream must be converted to Unicode characters for the tokenizer, as described by the rules for that encoding, except that the leading U+FEFF BYTE ORDER MARK character, if any, must not be stripped by the encoding layer (it is stripped by the rule below).
Bytes or sequences of bytes in the original byte stream that could not be converted to Unicode code points must be converted to U+FFFD REPLACEMENT CHARACTERs.
Bytes or sequences of bytes in the original byte stream that did not conform to the encoding specification (e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are errors that conformance checkers are expected to report.
Any byte or sequence of bytes in the original byte stream that is misinterpreted for compatibility is a parse error.
One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.
The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether that character was used to determine the byte order is a willful violation of Unicode, motivated by a desire to increase the resilience of user agents in the face of naïve transcoders.
All U+0000 NULL characters and code points in the range U+D800 to U+DFFF in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters and code points are parse errors.
Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF) characters are treated specially. Any CR characters that are followed by LF characters must be removed, and any CR characters not followed by LF characters must be converted to LF characters. Thus, newlines in HTML DOMs are represented by LF characters, and there are never any CR characters in the input to the tokenization stage.
The next input character is the first character in the input stream that has not yet been consumed. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed.
The insertion point is the position (just before a
character or just before the end of the input stream) where content
inserted using document.write() is actually
inserted. The insertion point is relative to the position of the
character immediately after it, it is not an absolute offset into
the input stream. Initially, the insertion point is
undefined.
The "EOF" character in the tables below is a conceptual character
representing the end of the input stream. If the parser
is a script-created parser, then the end of the
input stream is reached when an explicit "EOF"
character (inserted by the document.close() method) is
consumed. Otherwise, the "EOF" character is not a real character in
the stream, but rather the lack of any further characters.
When the parser requires the user agent to change the encoding, it must run the following steps. This might happen if the encoding sniffing algorithm described above failed to find an encoding, or if it found an encoding that was not the actual encoding of the file.
The insertion mode is a state variable that controls the primary operation of the tree construction stage.
Initially, the insertion mode is "initial". It can change to "before html", "before head", "in head", "in head noscript", "after head", "in body", "text", "in table", "in table text", "in caption", "in column group", "in table body", "in row", "in cell", "in select", "in select in table", "in foreign content", "after body", "in frameset", "after frameset", "after after body", and "after after frameset" during the course of the parsing, as described in the tree construction stage. The insertion mode affects how tokens are processed and whether CDATA sections are supported.
Seven of these modes, namely "in head", "in body", "in table", "in table body", "in row", "in cell", and "in select", are special, in that the other modes defer to them at various times. When the algorithm below says that the user agent is to do something "using the rules for the m insertion mode", where m is one of these modes, the user agent must use the rules described under the m insertion mode's section, but must leave the insertion mode unchanged unless the rules in m themselves switch the insertion mode to a new value.
When the insertion mode is switched to "text" or "in table text", the original insertion mode is also set. This is the insertion mode to which the tree construction stage will return.
When the insertion mode is switched to "in foreign content", the secondary insertion mode is also set. This secondary mode is used within the rules for the "in foreign content" mode to handle HTML (i.e. not foreign) content.
When the steps below require the UA to reset the insertion mode appropriately, it means the UA must follow these steps:
select element,
then switch the insertion mode to "in select" and jump to the
step labeled end. (fragment case)td or
th element and last is false, then
switch the insertion mode to "in cell" and jump to the step labeled end.tr element, then
switch the insertion mode to "in row" and jump to the step labeled end.tbody,
thead, or tfoot element, then switch the
insertion mode to "in table body" and jump to the step labeled end.caption element,
then switch the insertion mode to "in caption" and jump to
the step labeled end.colgroup element,
then switch the insertion mode to "in column group" and
jump to the step labeled end. (fragment
case)table element,
then switch the insertion mode to "in table" and jump to the
step labeled end.head element,
then switch the insertion mode to "in body" ("in body"! not "in head"!) and jump to
the step labeled end. (fragment
case)body element,
then switch the insertion mode to "in body" and jump to the
step labeled end.frameset element,
then switch the insertion mode to "in frameset" and jump to
the step labeled end. (fragment
case)html element,
then switch the insertion mode
to "before
head"
Then, jump to the step labeled end. (fragment case)Initially, the stack of open elements is empty. The stack grows downwards; the topmost node on the stack is the first one added to the stack, and the bottommost node of the stack is the most recently added node in the stack (notwithstanding when the stack is manipulated in a random access fashion as part of the handling for misnested tags).
The "before
html" insertion mode creates the
html root element node, which is then added to the
stack.
In the fragment case, the stack of open
elements is initialized to contain an html
element that is created as part of that algorithm. (The fragment
case skips the "before html" insertion mode.)
The html node, however it is created, is the topmost
node of the stack. It only gets popped off the stack when the parser
finishes.
The current node is the bottommost node in this stack.
The current table is the last table
element in the stack of open elements, if there is
one. If there is no table element in the stack of
open elements (fragment case), then the
current table is the first element in the stack
of open elements (the html element).
Elements in the stack fall into the following categories:
The following HTML elements have varying levels of special
parsing rules: address, area,
article, aside, base,
basefont, bgsound,
blockquote, body, br,
center, col, colgroup,
command, ,
dd, details, dir,
div, dl, dt,
embed, fieldset, figure,
footer, form, frame,
frameset, h1, h2,
h3, h4, h5, h6,
head, header, hgroup,
hr, iframe, img,
input, isindex, li,
link, listing, menu,
meta, nav, noembed,
noframes, noscript, ol,
p, param, plaintext,
pre, script, section,
select, style, tbody,
textarea, tfoot, thead,
title, tr, ul,
wbr, and xmp.
The following HTML elements introduce new scopes for various parts of the
parsing: applet, button,
caption, html, marquee,
object, table, td,
th, and SVG's foreignObject.
The following HTML elements are those that end up in the
list of active formatting elements: a,
b, big, code,
em, font, i,
nobr, s, small,
strike, strong, tt, and
u.
All other elements found while parsing an HTML document.
The stack of open elements is said to have an element in a specific scope consisting of a list of element types list when the following algorithm terminates in a match state:
Initialize node to be the current node (the bottommost node of the stack).
If node is the target node, terminate in a match state.
Otherwise, if node is one of the element types in list, terminate in a failure state.
Otherwise, set node to the previous
entry in the stack of open elements and return to step
2. (This will never fail, since the loop will always terminate in
the previous step if the top of the stack — an
html element — is reached.)
The stack of open elements is said to have an element in scope when it has an element in the specific scope consisting of the following element types:
applet in the HTML namespacecaption in the HTML namespacehtml in the HTML namespacetable in the HTML namespacetd in the HTML namespaceth in the HTML namespacebutton in the HTML namespacemarquee in the HTML namespaceobject in the HTML namespaceforeignObject in the SVG namespaceThe stack of open elements is said to have an element in list item scope when it has an element in the specific scope consisting of the following element types:
ol in the HTML namespaceul in the HTML namespaceThe stack of open elements is said to have an element in table scope when it has an element in the specific scope consisting of the following element types:
Nothing happens if at any time any of the elements in the
stack of open elements are moved to a new location in,
or removed from, the Document tree. In particular, the
stack is not changed in this situation. This can cause, amongst
other strange effects, content to be appended to nodes that are no
longer in the DOM.
In some cases (namely, when closing misnested formatting elements), the stack is manipulated in a random-access fashion.
Initially, the list of active formatting elements is empty. It is used to handle mis-nested formatting element tags.
The list contains elements in the formatting
category, and scope markers. The scope markers are inserted when
entering applet elements, buttons, object
elements, marquees, table cells, and table captions, and are used to
prevent formatting from "leaking" into applet
elements, buttons, object elements, marquees, and
tables.
The scope markers are unrelated to the concept of an element being in scope.
In addition, each element in the list of active formatting elements is associated with the token for which it was created, so that further elements can be created for that token if necessary.
When the steps below require the UA to reconstruct the active formatting elements, the UA must perform the following steps:
This has the effect of reopening all the formatting elements that were opened in the current body, cell, or caption (whichever is youngest) that haven't been explicitly closed.
The way this specification is written, the list of active formatting elements always consists of elements in chronological order with the least recently added element first and the most recently added element last (except for while steps 8 to 11 of the above algorithm are being executed, of course).
When the steps below require the UA to clear the list of active formatting elements up to the last marker, the UA must perform the following steps:
Initially, the head element
pointer and the form element
pointer are both null.
Once a head element has been parsed (whether
implicitly or explicitly) the head
element pointer gets set to point to this node.
The form element pointer
points to the last form element that was opened and
whose end tag has not yet been seen. It is used to make form
controls associate with forms in the face of dramatically bad
markup, for historical reasons.
The scripting flag is set to "enabled" if scripting was enabled for the
Document with which the parser is associated when the
parser was created, and "disabled" otherwise.
The scripting flag can be enabled even
when the parser was originally created for the HTML fragment
parsing algorithm, even though script elements
don't execute in that case.
The frameset-ok flag is set to "ok" when the parser is created. It is set to "not ok" after certain tokens are seen.
Implementations must act as if they used the following state machine to tokenize HTML. The state machine must start in the data state. Most states consume a single character, which may have various side-effects, and either switches the state machine to a new state to reconsume the same character, or switches it to a new state (to consume the next character), or repeats the same state (to consume the next character). Some states have more complicated behavior and can consume several characters before switching to another state. In some cases, the tokenizer state is also changed by the tree construction stage.
The exact behavior of certain states depends on the insertion mode and the stack of open elements. Certain states also use a temporary buffer to track progress.
The output of the tokenization step is a series of zero or more of the following tokens: DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public identifier, a system identifier, and a force-quirks flag. When a DOCTYPE token is created, its name, public identifier, and system identifier must be marked as missing (which is a distinct state from the empty string), and the force-quirks flag must be set to off (its other state is on). Start and end tag tokens have a tag name, a self-closing flag, and a list of attributes, each of which has a name and a value. When a start or end tag token is created, its self-closing flag must be unset (its other state is that it be set), and its attributes list must be empty. Comment and character tokens have data.
When a token is emitted, it must immediately be handled by the
tree construction stage. The tree construction stage
can affect the state of the tokenization stage, and can insert
additional characters into the stream. (For example, the
script element can result in scripts executing and
using the dynamic markup insertion APIs to insert
characters into the stream being tokenized.)
When a start tag token is emitted with its self-closing flag set, if the flag is not acknowledged when it is processed by the tree construction stage, that is a parse error.
When an end tag token is emitted with attributes, that is a parse error.
When an end tag token is emitted with its self-closing flag set, that is a parse error.
An appropriate end tag token is an end tag token whose tag name matches the tag name of the last start tag to have been emitted from this tokenizer, if any. If no start tag has been emitted from this tokenizer, then no end tag token is appropriate.
Before each step of the tokenizer, the user agent must first check the parser pause flag. If it is true, then the tokenizer must abort the processing of any nested invocations of the tokenizer, yielding control back to the caller.
The tokenizer state machine consists of the states defined in the following subsections.
Consume the next input character:
Attempt to consume a character reference, with no additional allowed character.
If nothing is returned, emit a U+0026 AMPERSAND character token.
Otherwise, emit the character token that was returned.
Finally, switch to the data state.
Consume the next input character:
Attempt to consume a character reference, with no additional allowed character.
If nothing is returned, emit a U+0026 AMPERSAND character token.
Otherwise, emit the character token that was returned.
Finally, switch to the RCDATA state.
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Switch to the script data escaped less-than sign state.
Consume the next input character:
Switch to the script data escaped less-than sign state.
Consume the next input character:
Switch to the script data escaped less-than sign state.
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
script", then switch to the script data
double escaped state. Otherwise, switch to the script
data escaped state.Consume the next input character:
Emit a U+003C LESS-THAN SIGN character token. Switch to the script data double escaped less-than sign state.
Consume the next input character:
Emit a U+003C LESS-THAN SIGN character token. Switch to the script data double escaped less-than sign state.
Consume the next input character:
Emit a U+003C LESS-THAN SIGN character token. Switch to the script data double escaped less-than sign state.
Consume the next input character:
Consume the next input character:
script", then switch to the script data
escaped state. Otherwise, switch to the script data
double escaped state.Consume the next input character:
Consume the next input character:
When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be dropped, along with the value that gets associated with it (if any).
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Attempt to consume a character reference.
If nothing is returned, append a U+0026 AMPERSAND character to the current attribute's value.
Otherwise, append the returned character token to the current attribute's value.
Finally, switch back to the attribute value state that you were in when were switched into this state.
Consume the next input character:
Consume the next input character:
Consume every character up to and including the first U+003E GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever comes first. Emit a comment token whose data is the concatenation of all the characters starting from and including the character that caused the state machine to switch into the bogus comment state, up to and including the character immediately before the last consumed character (i.e. up to the character just before the U+003E or EOF character). (If the comment was started by the end of the file (EOF), the token is empty.)
Switch to the data state.
If the end of the file was reached, reconsume the EOF character.
If the next two characters are both U+002D HYPHEN-MINUS characters (-), consume those two characters, create a comment token whose data is the empty string, and switch to the comment start state.
Otherwise, if the next seven characters are an ASCII case-insensitive match for the word "DOCTYPE", then consume those characters and switch to the DOCTYPE state.
Otherwise, if the insertion mode is "in foreign content" and the current node is not an element in the HTML namespace and the next seven characters are an case-sensitive match for the string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after), then consume those characters and switch to the CDATA section state.
Otherwise, this is a parse error. Switch to the bogus comment state. The next character that is consumed, if any, is the first character that will be in the comment.
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
If the six characters starting from the current input character are an ASCII case-insensitive match for the word "PUBLIC", then consume those characters and switch to the after DOCTYPE public keyword state.
Otherwise, if the six characters starting from the current input character are an ASCII case-insensitive match for the word "SYSTEM", then consume those characters and switch to the after DOCTYPE system keyword state.
Otherwise, this is the parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state.
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume the next input character:
Consume every character up to the next occurrence of the three
character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
BRACKET U+003E GREATER-THAN SIGN (]]>), or the
end of the file (EOF), whichever comes first. Emit a series of
character tokens consisting of all the characters consumed except
the matching three character sequence at the end (if one was found
before the end of the file).
Switch to the data state.
If the end of the file was reached, reconsume the EOF character.
This section defines how to consume a character reference. This definition is used when parsing character references in text and in attributes.
The behavior depends on the identity of the next character (the one immediately after the U+0026 AMPERSAND character):
Consume the U+0023 NUMBER SIGN.
The behavior further depends on the character after the U+0023 NUMBER SIGN:
Consume the X.
Follow the steps below, but using the range of characters U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f).
When it comes to interpreting the number, interpret it as a hexadecimal number.
Follow the steps below, but using the range of characters U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9).
When it comes to interpreting the number, interpret it as a decimal number.
Consume as many characters as match the range of characters given above.
If no characters match the range, then don't consume any characters (and unconsume the U+0023 NUMBER SIGN character and, if appropriate, the X character). This is a parse error; nothing is returned.
Otherwise, if the next character is a U+003B SEMICOLON, consume that too. If it isn't, there is a parse error.
If one or more characters match the range, then take them all and interpret the string of characters as a number (either hexadecimal or decimal as appropriate).
If that number is one of the numbers in the first column of the following table, then this is a parse error. Find the row with that number in the first column, and return a character token for the Unicode character given in the second column of that row.
| Number | Unicode character | |
|---|---|---|
| 0x00 | U+FFFD | REPLACEMENT CHARACTER |
| 0x0D | U+000A | LINE FEED (LF) |
| 0x80 | U+20AC | EURO SIGN (€) |
| 0x81 | U+0081 | <control> |
| 0x82 | U+201A | SINGLE LOW-9 QUOTATION MARK (‚) |
| 0x83 | U+0192 | LATIN SMALL LETTER F WITH HOOK (ƒ) |
| 0x84 | U+201E | DOUBLE LOW-9 QUOTATION MARK („) |
| 0x85 | U+2026 | HORIZONTAL ELLIPSIS (…) |
| 0x86 | U+2020 | DAGGER (†) |
| 0x87 | U+2021 | DOUBLE DAGGER (‡) |
| 0x88 | U+02C6 | MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ) |
| 0x89 | U+2030 | PER MILLE SIGN (‰) |
| 0x8A | U+0160 | LATIN CAPITAL LETTER S WITH CARON (Š) |
| 0x8B | U+2039 | SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹) |
| 0x8C | U+0152 | LATIN CAPITAL LIGATURE OE (Œ) |
| 0x8D | U+008D | <control> |
| 0x8E | U+017D | LATIN CAPITAL LETTER Z WITH CARON (Ž) |
| 0x8F | U+008F | <control> |
| 0x90 | U+0090 | <control> |
| 0x91 | U+2018 | LEFT SINGLE QUOTATION MARK (‘) |
| 0x92 | U+2019 | RIGHT SINGLE QUOTATION MARK (’) |
| 0x93 | U+201C | LEFT DOUBLE QUOTATION MARK (“) |
| 0x94 | U+201D | RIGHT DOUBLE QUOTATION MARK (”) |
| 0x95 | U+2022 | BULLET (•) |
| 0x96 | U+2013 | EN DASH (–) |
| 0x97 | U+2014 | EM DASH (—) |
| 0x98 | U+02DC | SMALL TILDE (˜) |
| 0x99 | U+2122 | TRADE MARK SIGN (™) |
| 0x9A | U+0161 | LATIN SMALL LETTER S WITH CARON (š) |
| 0x9B | U+203A | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›) |
| 0x9C | U+0153 | LATIN SMALL LIGATURE OE (œ) |
| 0x9D | U+009D | <control> |
| 0x9E | U+017E | LATIN SMALL LETTER Z WITH CARON (ž) |
| 0x9F | U+0178 | LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ) |
Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER.
Otherwise, return a character token for the Unicode character whose code point is that number. If the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.
Consume the maximum number of characters possible, with the consumed characters matching one of the identifiers in the first column of the named character references table (in a case-sensitive manner).
If no match can be made, then this is a parse error. No characters are consumed, and nothing is returned.
If the last character matched is not a U+003B SEMICOLON character (;), there is a parse error.
If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned.
Otherwise, return a character token for the character corresponding to the character reference name (as given by the second column of the named character references table).
If the markup contains I'm ¬it; I tell
you, the character reference is parsed as "not", as in,
I'm ¬it; I tell you. But if the markup
was I'm ∉ I tell you, the
character reference would be parsed as "notin;", resulting in
I'm ∉ I tell you.