This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").
Documents must consist of the following parts, in the given order:
html element.
The various types of content mentioned above are described in the next few sections.
In addition, there are some restrictions on how character encoding declarations are to be serialised, as discussed in the section on that topic.
The U+0000 NULL character must not appear anywhere in a document.
Space characters before the root html element will be dropped when the document is
parsed; space characters after the root html element will be parsed as if they were at the
end of the html element. Thus, space
characters around the root element do not round-trip. It is suggested that
newlines be inserted after the DOCTYPE and any comments that aren't in the
root element.
A DOCTYPE is a mostly useless, but required, header.
DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.
A DOCTYPE must consist of the following characters, in this order:
<) character.
!) character.
>) character.
In other words, <!DOCTYPE HTML>,
case-insensitively.
There are four different kinds of elements: void elements, CDATA elements, RCDATA elements, and normal elements.
base, link, meta,
hr, br,
img, embed, param,
area, col, inputstyle, scripttitle, textarea
Tags are used to delimit the start and end of elements in the markup. CDATA, RCDATA, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described later. Those that cannot be omitted must not be omitted. Void elements only have a start tag; end tags must not be specified for void elements.
The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depends on the content model of that element, as described earlier in this specification. Elements must not contain content that their content model disallows. In addition to the restrictions placed on the contents by those content models, however, the four types of elements have additional syntactic requirements.
Void elements can't have any contents (since there's no end tag, no content can be put between the start tag and the end tag.)
CDATA elements can have text, though it has restrictions described below.
RCDATA elements can have text and character entity references, but the text must not contain an ambiguous ampersand. There are also further restrictions described below.
Normal elements can have text, character entity references,
other elements, and comments, but the text must
not contain the character U+003C LESS-THAN SIGN (<) or an
ambiguous
ampersand. Some normal elements also have yet more restrictions on what content
they are allowed to hold, beyond the restrictions imposed by the content
model and those described in this paragraph. Those restrictions are
described below.
Tags contain a tag name,
giving the element's name. HTML elements all have names that only use
characters in the range U+0061 LATIN SMALL LETTER A .. U+007A LATIN SMALL
LETTER Z, or, in uppercase, U+0041 LATIN CAPITAL LETTER A .. U+005A LATIN
CAPITAL LETTER Z, and U+002D HYPHEN-MINUS (-). In the HTML
syntax, tag names may be written with any mix of lower- and uppercase
letters that, when converted to all-lowercase, matches the element's tag
name; tag names are case-insensitive.
Start tags must have the following format:
<).
/) character. This character has no
effect except to appease the markup gods. As this character is therefore
just a symbol of faith, atheists should omit it.
>) character.
End tags must have the following format:
<).
/).
>) character.
Attributes for an element are expressed inside the element's start tag.
Attributes have a name and a value. Attribute names must consist of one character other than the space characters, U+003E GREATER-THAN SIGN (>), and U+002F SOLIDUS (/), followed by zero or more characters other than the space characters, U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=). In the HTML syntax, attribute names may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the attribute's name; attribute names are case-insensitive.
Attribute values are a mixture of text and character entity references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Attributes can be specified in four different ways:
Just the attribute name.
In the following example, the disabled attribute is given with the
empty attribute syntax:
<input disabled>
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
The attribute
name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters, followed by the attribute value, which, in addition to
the requirements given above for attribute values, must not contain any
literal space characters or
U+003E GREATER-THAN SIGN (>) characters, and must not,
furthermore, start with either a literal U+0022 QUOTATION MARK
(") character or a literal U+0027 APOSTROPHE
(') character.
In the following example, the value attribute is given with the
unquoted attribute value syntax:
<input value=yes>
If an attribute using the unquoted attribute syntax is to be followed
by another attribute or by one of the optional U+002F SOLIDUS
(/) characters allowed in step 6 of the start tag syntax above, then there must be
a space character separating the two.
The attribute
name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters, followed by a single U+0027 APOSTROPHE
(') character, followed by the attribute value, which, in addition to
the requirements given above for attribute values, must not contain any
literal U+0027 APOSTROPHE (') characters, and finally
followed by a second single U+0027 APOSTROPHE (')
character.
In the following example, the type
attribute is given with the single-quoted attribute value syntax:
<input type='checkbox'>
The attribute
name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN
character, followed by zero or more space characters, followed by a single U+0022 QUOTATION
MARK (") character, followed by the attribute value, which, in addition to
the requirements given above for attribute values, must not contain any
literal U+0022 QUOTATION MARK (") characters, and finally
followed by a second single U+0022 QUOTATION MARK (")
character.
In the following example, the name
attribute is given with the double-quoted attribute value syntax:
<input name="be evil">
Certain tags can be omitted.
An html element's start tag may be omitted if the first thing
inside the html element is not a space character or a comment.
An html element's end tag may be omitted if the html element is not immediately followed by a space character or a comment.
A head element's start tag may be omitted if the first thing
inside the head element is an element.
A head element's end tag may be omitted if the head element is not immediately followed by a space character or a comment.
A body element's start tag may be omitted if the first thing
inside the body element is not a space character or a comment, except if the first thing inside the
body element is a script or style
element.
A body element's end tag may be omitted if the body element is not immediately followed by a space character or a comment.
A li element's end tag may be omitted if the li element is immediately followed by another
li element or if there is no more content
in the parent element.
A dt element's end tag may be omitted if the dt element is immediately followed by another
dt element or a dd element.
A dd element's end tag may be omitted if the dd element is immediately followed by another
dd element or a dt element, or if there is no more content in the
parent element.
A p element's end tag may be omitted if the p element is immediately followed by an address, blockquote, dl, fieldset, form,
h1, h2,
h3, h4,
h5, h6,
hr, menu,
ol, p,
pre, table, or ul
element, or if there is no more content in the parent element.
An optgroup element's end
tag may be omitted if the optgroup element is
immediately followed by another optgroup element, or if there
is no more content in the parent element.
An option element's end
tag may be omitted if the option element is
immediately followed by another option element, or if there
is no more content in the parent element.
A colgroup element's start tag may be omitted if the first thing
inside the colgroup element is a
col element, and if the element is not
immediately preceeded by another colgroup element whose end tag has been omitted.
A colgroup element's end tag may be omitted if the colgroup element is not immediately followed
by a space character or a comment.
A thead element's end tag may be omitted if the thead element is immediately followed by a
tbody or tfoot element.
A tbody element's start tag may be omitted if the first thing
inside the tbody element is a tr element, and if the element is not immediately
preceeded by a tbody, thead, or tfoot element whose end tag has been omitted.
A tbody element's end tag may be omitted if the tbody element is immediately followed by a
tbody or tfoot element, or if there is no more content in
the parent element.
A tfoot element's end tag may be omitted if the tfoot element is immediately followed by a
tbody element, or if there is no more
content in the parent element.
A tr element's end tag may be omitted if the tr element is immediately followed by another
tr element, or if there is no more content
in the parent element.
A td element's end tag may be omitted if the td element is immediately followed by a td or th element, or
if there is no more content in the parent element.
A th element's end tag may be omitted if the th element is immediately followed by a td or th element, or
if there is no more content in the parent element.
However, a start tag must never be omitted if it has any attributes.
For historical reasons, certain elements have extra restrictions beyond even the restrictions given by their content model.
A p element must not contain blockquote, dl, menu, ol, pre, table, or ul
elements, even though these elements are technically allowed inside
p elements according to the content models
described in this specification. (In fact, if one of those elements is put
inside a p element in the markup, it will
instead imply a p element end tag before
it.)
An optgroup element must not contain optgroup
elements, even though these elements are technically allowed to be nested
according to the content models described in this specification. (If an
optgroup element is put inside another in the markup, it will
in fact imply an optgroup end tag before it.)
A table element must not contain
tr elements, even though these elements are
technically allowed inside table
elements according to the content models described in this specification.
(If a tr element is put inside a table in the markup, it will in fact imply a
tbody start tag before it.)
A single U+000A LINE FEED (LF) character may be placed immediately after
the start tag of pre and textarea elements. This does
not affect the processing of the element. The otherwise optional U+000A
LINE FEED (LF) character must be included if the element's
contents start with that character (because otherwise the leading newline
in the contents would be treated like the optional newline, and ignored).
The text in CDATA and RCDATA elements must not contain any occurences of
the string "</" (U+003C LESS-THAN SIGN, U+002F
SOLIDUS) followed by characters that case-insensitively match
the tag name of the element followed by one of U+0009 CHARACTER
TABULATION, U+000A LINE FEED (LF), U+000B LINE TABULATION, U+000C FORM
FEED (FF), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or U+002F SOLIDUS
(/), unless that string is part of an escaping text span.
An escaping text span is a span of text (in CDATA and RCDATA elements) and character entity references (in RCDATA elements) that starts with an escaping text span start that is not itself in an escaping text span, and ends at the next escaping text span end.
An escaping text span
start is a part of text that
consists of the four character sequence "<!--"
(U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS,
U+002D HYPHEN-MINUS).
An escaping text span
end is a part of text that
consists of the three character sequence "-->"
(U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN) whose
U+003E GREATER-THAN SIGN (>).
An escaping text span start may share its U+002D HYPHEN-MINUS characters with its corresponding escaping text span end.
The text in CDATA and RCDATA elements must not have an escaping text span start that is not followed by an escaping text span end.
Text is allowed inside elements, attributes, and comments. Text must consist of valid Unicode characters other than U+0000. Text should not contain control characters other than space characters. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections.
Newlines in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.
In certain cases described in other sections, text may be mixed with character entity references. These can be used to escape characters that couldn't otherwise legally be included in text.
Character entity references must start with a U+0026 AMPERSAND
(&). Following this, there are three possible kinds of
character entity references:
;) character.
#) character, followed by one or more digits in the range
U+0030 DIGIT ZERO .. U+0039 DIGIT NINE, representing a base-ten integer
that itself is a valid Unicode code point that is not U+0000, U+000D, in
the range U+0080 .. U+009F, or in the range 0xD800 .. 0xDFFF
(surrogates). The digits must then be followed by a U+003B SEMICOLON
character (;).
#) character, which must be followed by either a U+0078
LATIN SMALL LETTER X or a U+0058 LATIN CAPITAL LETTER X character, which
must then be followed by one or more digits in the range U+0030 DIGIT
ZERO .. U+0039 DIGIT NINE, U+0061 LATIN SMALL LETTER A .. U+0066 LATIN
SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A .. U+0046 LATIN CAPITAL
LETTER F, representing a base-sixteen integer that itself is a valid
Unicode code point that is not U+0000, U+000D, in the range U+0080 ..
U+009F, or in the range 0xD800 .. 0xDFFF (surrogates). The digits must
then be followed by a U+003B SEMICOLON character (;).
An ambiguous
ampersand is a U+0026 AMPERSAND (&) character that
is not the last character in the file, that is not followed by a space character, that is not followed by a start tag
that has not been omitted, and that is not followed by another U+0026
AMPERSAND (&) character.
Comments must start with
the four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION
MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--). Following this sequence, the comment may have text, with the additional restriction
that the text must not contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a U+002D HYPHEN-MINUS (-) character. Finally, the comment must be ended by the
three character sequence U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E
GREATER-THAN SIGN (-->).