Varol Cagdas Tok

Personal notes and articles.

XSS Defense: Output Encoding

Output encoding converts characters with structural meaning in the output context into sequences that parsers treat as data. The goal is to ensure that user-supplied content cannot change the structure of the surrounding document, regardless of what characters it contains.


Encoding Is Context-Dependent

The correct encoding depends on where the data is placed in the output. HTML, JavaScript, URLs, and CSS each have different parsers with different sets of special characters. Applying the wrong encoding for the context either fails to prevent injection or corrupts the data.

HTML body context: data inserted as element content. The characters that must be encoded are those that start or modify HTML structure: < becomes <, > becomes >, & becomes &, " becomes ", ' becomes '. Encoding these prevents the parser from treating the data as the start of a new tag or attribute.

HTML attribute context: data inserted as the value of an HTML attribute. The attribute must be quoted. Within a quoted attribute, the quote character itself must be encoded. For double-quoted attributes, " must be encoded. Unquoted attributes are difficult to encode correctly and should be avoided entirely.

JavaScript string context: data inserted into a JavaScript string literal in a <script> block. HTML entity encoding is not sufficient here because the JavaScript parser runs before the HTML entity decoder in this context. Characters that terminate or escape from the string must be JavaScript-escaped: ' becomes \', " becomes \", and \ becomes \\. Line terminators must also be escaped since they terminate string literals. A safer approach is to use JSON encoding and verify the output does not contain </script>, which would prematurely close the script block in some parsers.

URL context: data inserted into a URL query parameter or path segment. Percent-encoding using encodeURIComponent() encodes all characters outside the unreserved set. This prevents injection of javascript: schemes or path traversal sequences.


Input Validation Is Not a Substitute

Input validation blocks known-bad characters at the point of entry. It is incomplete as an XSS defense because: the set of characters needed to inject varies by output context; validation logic may not match what each renderer requires; data may be used in multiple contexts; and validation applied at one layer may be bypassed or stripped at another. Encoding at the point of output is required because that is where the parser processes the data. Validation can reduce the attack surface but does not replace encoding.


Template Engines and Auto-Escaping

Most server-side template engines provide auto-escaping: variables inserted into templates are HTML-encoded by default. This works for the HTML body context. It does not automatically handle JavaScript string context, URL context, or CSS context. Disabling escaping (through mechanisms like Jinja2's |safe filter or Twig's raw filter) to render HTML content should be avoided for user-supplied data, or the data must be pre-sanitized with a library that understands HTML structure (such as DOMPurify).