This is a snapshot of an early working draft and has therefore been superseded by the HTML standard.

This document will not be further updated.

HTML 5

Call For Comments — 27 October 2007

4.9. Determining the type of a new resource in a browsing context

It is imperative that the rules in this section be followed exactly. When two user agents use different heuristics for content type detection, security problems can occur. For example, if a server believes a contributed file to be an image (and thus benign), but a Web browser believes the content to be HTML (and thus capable of executing script), the end user can be exposed to malicious content, making the user vulnerable to cookie theft attacks and other cross-site scripting attacks.

The sniffed type of a resource must be found as follows:

  1. If the resource was fetched over an HTTP protocol, and there is no HTTP Content-Encoding header, but there is an HTTP Content-Type header and it has a value whose bytes exactly match one of the following three lines:

    Bytes in Hexadecimal Textual representation
    74 65 78 74 2f 70 6c 61 69 6e text/plain
    74 65 78 74 2f 70 6c 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 49 53 4f 2d 38 38 35 39 2d 31 text/plain; charset=ISO-8859-1
    74 65 78 74 2f 70 6c 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 69 73 6f 2d 38 38 35 39 2d 31 text/plain; charset=iso-8859-1

    ...then jump to the text or binary section below.

  2. Let official type be the type given by the Content-Type metadata for the resource (in lowercase, ignoring any parameters). If there is no such type, jump to the unknown type step below.

  3. If official type is "unknown/unknown" or "application/unknown", jump to the unknown type step below.

  4. If official type ends in "+xml", or if it is either "text/xml" or "application/xml", then the the sniffed type of the resource is official type; return that and abort these steps.

  5. If official type is an image type supported by the user agent (e.g. "image/png", "image/gif", "image/jpeg", etc), then jump to the images section below.

  6. If official type is "text/html", then jump to the feed or HTML section below.

  7. Otherwise, the sniffed type of the resource is official type.

4.9.1. Content-Type sniffing: text or binary

  1. The user agent may wait for 512 or more bytes of the resource to be available.

  2. Let n be the smaller of either 512 or the number of bytes already available.

  3. If n is 4 or more, and the first bytes of the file match one of the following byte sets:

    Bytes in Hexadecimal Description
    FE FF UTF-16BE BOM or UTF-32LE BOM
    FF FE UTF-16LE BOM
    00 00 FE FF UTF-32BE BOM
    EF BB BF UTF-8 BOM

    ...then the sniffed type of the resource is "text/plain".

Should we remove UTF-32 from the above?

4.9.2. Content-Type sniffing: unknown type

  1. The user agent may wait for 512 or more bytes of the resource to be available.

  2. Let stream length be the smaller of either 512 or the number of bytes already available.

  3. For each row in the table below:

    If the row has no "WS" bytes:
    1. Let pattern length be the length of the pattern (number of bytes described by the cell in the second column of the row).
    2. If pattern length is smaller than stream length then skip this row.
    3. Apply the "and" operator to the first pattern length bytes of the resource and the given mask (the bytes in the cell of first column of that row), and let the result be the data.
    4. If the bytes of the data matches the given pattern bytes exactly, then the sniffed type of the resource is the type given in the cell of the third column in that row; abort these steps.
    If the row has a "WS" byte:
    1. Let indexpattern be an index into the mask and pattern byte strings of the row.

    2. Let indexstream be an index into the byte stream being examined.

    3. Loop: If indexstream points beyond the end of the byte stream, then this row doesn't match, skip this row.

    4. Examine the indexstreamth byte of the byte stream as follows:

      If the indexstreamth byte of the pattern is a normal hexadecimal byte and not a "WS" byte:

      If the "and" operator, applied to the indexstreamth byte of the stream and the indexpatternth byte of the mask, yield a value different that the indexpatternth byte of the pattern, then skip this row.

      Otherwise, increment indexpattern to the next byte in the mask and pattern and indexstream to the next byte in the byte stream.

      Otherwies, if the indexstreamth byte of the pattern is a "WS" byte:

      "WS" means "whitespace", and allows insignificant whitespace to be skipped when sniffing for a type signature.

      If the indexstreamth byte of the stream is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0B (ASCII VT), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space), then increment only the indexstream to the next byte in the byte stream.

      Otherwise, increment only the indexpattern to the next byte in the mask and pattern.

    5. If indexpattern does not point beyond the end of the mask and pattern byte strings, then jump back to the loop step in this algorithm.

    6. Otherwise, the sniffed type of the resource is the type given in the cell of the third column in that row; abort these steps.

  4. As a last-ditch effort, jump to the text or binary section.

Bytes in Hexadecimal Sniffed type Comment
Mask Pattern
FF FF DF DF DF DF DF DF DF FF DF DF DF DF 3C 21 44 4F 43 54 59 50 45 20 48 54 4D 4C text/html The string "<!DOCTYPE HTML" in US-ASCII or compatible encodings, case-insensitively.
FF FF DF DF DF DF WS 3C 48 54 4D 4C text/html The string "<HTML" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
FF FF DF DF DF DF WS 3C 48 45 41 44 text/html The string "<HEAD" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
FF FF DF DF DF DF DF DF WS 3C 53 43 52 49 50 54 text/html The string "<SCRIPT" in US-ASCII or compatible encodings, case-insensitively, possibly with leading spaces.
FF FF FF FF FF 25 50 44 46 2D application/pdf The string "%PDF-", the PDF signature.
FF FF FF FF FF FF FF FF FF FF FF 25 21 50 53 2D 41 64 6F 62 65 2D application/postscript The string "%!PS-Adobe-", the PostScript signature.
FF FF FF FF FF FF 47 49 46 38 37 61 image/gif The string "GIF87a", a GIF signature.
FF FF FF FF FF FF 47 49 46 38 39 61 image/gif The string "GIF89a", a GIF signature.
FF FF FF FF FF FF FF FF 89 50 4E 47 0D 0A 1A 0A image/png The PNG signature.
FF FF FF FF D8 FF image/jpeg A JPEG SOI marker followed by the first byte of another marker.
FF FF 42 4D image/bmp The string "BM", a BMP signature.

User agents may support further types if desired, by implicitly adding to the above table. However, user agents should not use any other patterns for types already mentioned in the table above, as this could then be used for privilege escalation (where, e.g., a server uses the above table to determine that content is not HTML and thus safe from XSS attacks, but then a user agent detects it as HTML anyway and allows script to execute).

4.9.3. Content-Type sniffing: image

If the first bytes of the file match one of the byte sequences in the first columns of the following table, then the sniffed type of the resource is the type given in the corresponding cell in the second column on the same row:

Bytes in Hexadecimal Sniffed type Comment
47 49 46 38 37 61 image/gif The string "GIF87a", a GIF signature.
47 49 46 38 39 61 image/gif The string "GIF89a", a GIF signature.
89 50 4E 47 0D 0A 1A 0A image/png The PNG signature.
FF D8 FF image/jpeg A JPEG SOI marker followed by the first byte of another marker.
42 4D image/bmp The string "BM", a BMP signature.

User agents must ignore any rows for image types that they do not support.

Otherwise, the sniffed type of the resource is the same as its official type.

4.9.4. Content-Type sniffing: feed or HTML

  1. The user agent may wait for 512 or more bytes of the resource to be available.

  2. Let s be the stream of bytes, and let s[i] represent the byte in s with position i, treating s as zero-indexed (so the first byte is at i=0).

  3. If at any point this algorithm requires the user agent to determine the value of a byte in s which is not yet available, or which is past the first 512 bytes of the resource, or which is beyond the end of the resource, the user agent must stop this algorithm, and assume that the sniffed type of the resource is "text/html".

    User agents are allowed, by the first step of this algorithm, to wait until the first 512 bytes of the resource are available.

  4. Initialise pos to 0.

  5. Examine s[pos].

    If it is 0x09 (ASCII tab), 0x20 (ASCII space), 0x0A (ASCII LF), or 0x0D (ASCII CR)
    Increase pos by 1 and repeat this step.
    If it is 0x3C (ASCII "<")
    Increase pos by 1 and go to the next step.
    If it is anything else
    The sniffed type of the resource is "text/html". Abort these steps.
  6. If the bytes with positions pos to pos+2 in s are exactly equal to 0x21, 0x2D, 0x2D respectively (ASCII for "!--"), then:

    1. Increase pos by 3.
    2. If the bytes with positions pos to pos+2 in s are exactly equal to 0x2D, 0x2D, 0x3E respectively (ASCII for "-->"), then increase pos by 3 and jump back to the previous step (step 5) in the overall algorithm in this section.
    3. Otherwise, increase pos by 1.
    4. Otherwise, return to step 2 in these substeps.
  7. If s[pos] is 0x21 (ASCII "!"):

    1. Increase pos by 1.
    2. If s[pos] equal 0x3E, then increase pos by 1 and jump back to step 5 in the overall algorithm in this section.
    3. Otherwise, return to step 1 in these substeps.
  8. If s[pos] is 0x3F (ASCII "?"):

    1. Increase pos by 1.
    2. If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively, then increase pos by 1 and jump back to step 5 in the overall algorithm in this section.
    3. Otherwise, return to step 1 in these substeps.
  9. Otherwise, if the bytes in s starting at pos match any of the sequences of bytes in the first column of the following table, then the user agent must follow the steps given in the corresponding cell in the second column of the same row.

    Bytes in Hexadecimal Requirement Comment
    72 73 73 The sniffed type of the resource is "application/rss+xml"; abort these steps The three ASCII characters "rss"
    66 65 65 64 The sniffed type of the resource is "application/atom+xml"; abort these steps The four ASCII characters "feed"
    72 64 66 3A 52 44 46 Continue to the next step in this algorithm The ASCII characters "rdf:RDF"

    If none of the byte sequences above match the bytes in s starting at pos, then the sniffed type of the resource is "text/html". Abort these steps.

  10. If, before the next ">", you find two xmlns* attributes with http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/rss/1.0/ as the namespaces, then the sniffed type of the resource is "application/rss+xml", abort these steps. (maybe we only need to check for http://purl.org/rss/1.0/ actually)

  11. Otherwise, the sniffed type of the resource is "text/html".

For efficiency reaons, implementations may wish to implement this algorithm and the algorithm for detecting the character encoding of HTML documents in parallel.

4.9.5. Content-Type metadata

What explicit Content-Type metadata is associated with the resource (the resource's type information) depends on the protocol that was used to fetch the resource.

For HTTP resources, only the Content-Type HTTP header contributes any data; the explicit type of the resource is then the value of that header, interpreted as described by the HTTP specifications. If the Content-Type HTTP header is present but it cannot be interpreted as described by the HTTP specifications (e.g. because its value doesn't contain a U+002F SOLIDUS ('/') character), then the resource has no type information. [HTTP]

For resources fetched from the filesystem, user agents should use platform-specific conventions, e.g. operating system extension/type mappings.

Extensions must not be used for determining resource types for resources fetched over HTTP.

For resources fetched over most other protocols, e.g. FTP, there is no type information.

The algorithm for extracting an encoding from a Content-Type, given a string s, is as follows. It either returns a encoding or nothing.

  1. Skip characters in s up to and including the first U+003B SEMICOLON (;) character.

  2. Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters (i.e. spaces) that immediately follow the semicolon.

  3. If the next six characters are not 'charset', return nothing.

  4. Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters that immediately follow the word 'charset' (there might not be any).

  5. If the next character is not a U+003D EQUALS SIGN ('='), return nothing.

  6. Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters that immediately follow the word equals sign (there might not be any).

  7. Process the next character as follows:

    If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s

    Return string between the two quotation marks.

    If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s

    Return the string between the two apostrophes.

    If it is an unmatched U+0022 QUOTATION MARK ('"')
    If it is an unmatched U+0027 APOSTROPHE ("'")

    Return nothing.

    Otherwise

    Return the string from this character to the first U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 character or the end of s, whichever comes first.