This section only applies to user agents, data mining tools, and conformance checkers.
The rules for parsing XML documents (and thus XHTML documents) into DOM trees are covered by the XML and Namespaces in XML specifications, and are out of scope of this specification. [XML] [XMLNS]
For HTML documents, user agents must use the parsing rules described in this section to generate the DOM trees. Together, these rules define what is referred to as the HTML parser.
While the HTML form of HTML5 bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed Web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML tools and the XML serialisation of HTML5.
This specification defines the parsing rules for HTML documents, whether they are syntactically valid or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must abort processing at the first error that they encounter for which they do not wish to apply the rules described below.
Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error conditions exist in the document. Conformance checkers are not required to recover from parse errors.
Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.
The input to the HTML parsing process consists of a stream of Unicode
characters, which is passed through a tokenisation stage (lexical analysis) followed
by a tree construction stage (semantic
analysis). The output is a Document object.
Implementations that do not support
scripting do not have to actually create a DOM Document
object, but the DOM tree in such cases is still used as the model for the
rest of the specification.
In the common case, the data handled by the tokenisation stage comes
from the network, but it can also come from script, e.g. using the document.write() API.
There is only one set of state for the tokeniser stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokeniser might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.
In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" start tag token:
...
<script>
document.write('<p>');
</script>
...
The stream of Unicode characters that consists the input to the tokenisation stage will be initially seen by the user agent as a stream of bytes (typically coming over the network or from the local file system). The bytes encode the actual characters according to a particular character encoding, which the user agent must use to decode the bytes into characters.
For XML documents, the algorithm user agents must use to determine the character encoding is given by the XML specification. This section does not apply to XML documents. [XML]
In some cases, it might be impractical to unambiguously determine the encoding before parsing the document. Because of this, this specification provides for a two-pass mechanism with an optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing algorithm to whatever bytes they have available before beginning to parse the document. Then, the real parser is started, using a tentative encoding derived from this pre-parse and other out-of-band metadata. If, while the document is being loaded, the user agent discovers an encoding declaration that conflicts with this information, then the parser can get reinvoked to perform a parse of the document with the real encoding.
User agents must use the following algorithm (the encoding sniffing algorithm) to determine the character encoding to use when decoding a document in the first pass. This algorithm takes as input any out-of-band metadata available to the user agent (e.g. the Content-Type metadata of the document) and all the bytes available so far, and returns an encoding and a confidence. The confidence is either tentative or certain. The encoding used, and whether the confidence in that encoding is tentative or confident, is used during the parsing to determine whether to change the encoding.
If the transport layer specifies an encoding, return that encoding with the confidence certain, and abort these steps.
The user agent may wait for more bytes of the resource to be available, either in this step or at any later step in this algorithm. For instance, a user agent might wait 500ms or 512 bytes, whichever came first. In general preparsing the source to find the encoding improves performance, as it reduces the need to throw away the data structures used when parsing upon finding the encoding information. However, if the user agent delays too long to obtain data to determine the encoding, then the cost of the delay could outweigh any performance improvements from the preparse.
For each of the rows in the following table, starting with the first one and going down, if there are as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then return the encoding given in the cell in the second column of that row, with the confidence certain, and abort these steps:
| Bytes in Hexadecimal | Description |
|---|---|
| FE FF | UTF-16BE BOM |
| FF FE | UTF-16LE BOM |
| EF BB BF | UTF-8 BOM |
Otherwise, the user agent will have to search for explicit character encoding information in the file itself. This should proceed as follows:
Let position be a pointer to a byte in the input stream, initially pointing at the first byte. If at any point during these substeps the user agent either runs out of bytes or decides that scanning further bytes would not be efficient, then skip to the next step of the overall character encoding detection algorithm. User agents may decide that scanning any bytes is not efficient, in which case these substeps are entirely skipped.
Now, repeat the following "two" steps until the algorithm aborts (either because user agent aborts, as described above, or because a character encoding is found):
If position points to:
Advance the position pointer so that it points at the first 0x3E byte which is preceeded by two 0x2D bytes (i.e. at the end of an ASCII '-->' sequence) and comes after the 0x3C byte that was found. (The two 0x2D bytes can be the same as the those in the '<!--' sequence.)
Advance the position pointer so that it points at the next 0x09, 0x0A, 0x0B, 0x0C, 0x0D, or 0x20 byte (the one in sequence of characters matched above).
Get an attribute and its value. If no attribute was sniffed, then skip this inner set of steps, and jump to the second step in the overall "two step" algorithm.
Examine the attribute's name:
If the attribute's value is a supported character encoding, then return the given encoding, with confidence tentative, and abort all these steps. Otherwise, do nothing with this attribute, and continue looking for other attributes.
The attribute's value is now parsed.
Do nothing with that attribute.
Return to step 1 in these inner steps.
Advance the position pointer so that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0B (ASCII VT), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), 0x3E (ASCII '>'), 0x3C (ASCII '<') byte.
If the pointer points to a 0x3C (ASCII '<') byte, then return to the first step in the overall "two step" algorithm.
Repeatedly get an attribute until no further attributes can be found, then jump to the second step in the overall "two step" algorithm.
Advance the position pointer so that it points at the first 0x3E byte (ASCII '>') that comes after the 0x3C byte that was found.
Do nothing with that byte.
When the above "two step" algorithm says to get an attribute, it means doing this:
If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0B (ASCII VT), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x2F (ASCII '/') then advance position to the next byte and start over.
If the byte at position is 0x3C (ASCII '<'), then move position back to the previous byte, and stop looking for an attribute. There isn't one.
If the byte at position is 0x3E (ASCII '>'), then stop looking for an attribute. There isn't one.
Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string.
Attribute name: Process the byte at position as follows:
Advance position to the next byte and return to the previous step.
Spaces. If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0B (ASCII VT), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.
If the byte at position is not 0x3D (ASCII '='), stop looking for an attribute. Move position back to the previous byte. The attribute's name is the value of attribute name, its value is the empty string.
Advance position past the 0x3D (ASCII '=') byte.
Value. If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0B (ASCII VT), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.
Process the byte at position as follows:
Process the byte at position as follows:
Advance position to the next byte and return to the previous step.
For the sake of interoperability, user agents should not use a pre-scan algorithm that returns different results than the one described above. (But, if you do, please at least let us know, so that we can improve this algorithm and benefit everyone...)
If the user agent has information on the likely encoding for this page, e.g. based on the encoding of the page when it was last visited, then return that encoding, with the confidence tentative, and abort these steps.
The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. If autodetection succeeds in determining a character encoding, then return that encoding, with the confidence tentative, and abort these steps. [UNIVCHARDET]
Otherwise, return an implementation-defined or user-specified default
character encoding, with the confidence tentative. Due
to its use in legacy content, windows-1252 is
recommended as a default in predominantly Western demographics. In
non-legacy environments, the more comprehensive UTF-8 encoding is recommended instead. Since these
encodings can in many cases be distinguished by inspection, a user agent
may heuristically decide which to use as a default.
User agents must at a minimum support the UTF-8 and Windows-1252 encodings, but may support more.
It is not unusual for Web browsers to support dozens if not upwards of a hundred distinct character encodings.
User agents must support the preferred MIME name of every character encoding they support that has a preferred MIME name, and should support all the IANA-registered aliases. [IANACHARSET]
When a user agent would otherwise use the ISO-8859-1 encoding, it must instead use the Windows-1252 encoding. User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings. [CESU8] [UTF7] [BOCU1] [SCSU]
Support for UTF-32 is not recommended. This encoding is rarely used, and frequently misimplemented.
Given an encoding, the bytes in the input stream must be converted to Unicode characters for the tokeniser, as described by the rules for that encoding, except that leading U+FEFF BYTE ORDER MARK characters must not be stripped by the encoding layer.
Bytes or sequences of bytes in the original byte stream that could not be converted to Unicode characters must be converted to U+FFFD REPLACEMENT CHARACTER code points.
One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.
All U+0000 NULL characters in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse error.
U+000D CARRIAGE RETURN (CR) characters, and U+000A LINE FEED (LF) characters, are treated specially. Any CR characters that are followed by LF characters must be removed, and any CR characters not followed by LF characters must be converted to LF characters. Thus, newlines in HTML DOMs are represented by LF characters, and there are never any CR characters in the input to the tokenisation stage.
The next input character is the first character in the input stream that has not yet been consumed. Initially, the next input character is the first character in the input.
The insertion point is the position (just before
a character or just before the end of the input stream) where content
inserted using document.write() is actually
inserted. The insertion point is relative to the position of the character
immediately after it, it is not an absolute offset into the input stream.
Initially, the insertion point is uninitialised.
The "EOF" character in the tables below is a conceptual character
representing the end of the input stream. If the
parser is a script-created parser, then the
end of the input stream is reached when an explicit "EOF" character (inserted by the document.close()
method) is consumed. Otherwise, the "EOF" character is not a real
character in the stream, but rather the lack of any further characters.
When the parser requires the user agent to change the encoding, it must run the following steps. This might happen if the encoding sniffing algorithm described above failed to find an encoding, or if it found an encoding that was not the actual encoding of the file.
While the invocation of this algorithm is not a parse error, it is still indicative of non-conforming content.