Matrix Calculus: Gradients with Vectors and Matrices
Motivation
Machine learning optimization requires computing derivatives of scalar loss functions with respect to vector or matrix parameters. Matrix calculus provides notation and rules for these computations.
Scalar-by-Vector Derivative
Given a scalar function f: \(\mathbb{R}^{n}\) → \(\mathbb{R}\) and vector \(x \in \mathbb{R}^{n}\), the gradient is:
The gradient is a column vector in \(\mathbb{R}^{n}\).
Example: Quadratic Function
Linear Function
Quadratic Form
where \(A \in \mathbb{R}^{n \times n}\) is symmetric.
Derivation
Since \(A\) is symmetric (\(a_{ij} = a_{ji}\)):
Thus \(\nabla_x f = 2Ax\).
Example
\(A = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\), \(x = [x_1, x_2]^T\)
Affine Function Squared Norm
Expand:
This is used in linear regression: \(f(w) = \|Xw - y\|_2^2\)
Scalar-by-Matrix Derivative
Given \(f: \mathbb{R}^{m \times n} \to \mathbb{R}\) and matrix \(X \in \mathbb{R}^{m \times n}\), the gradient is:
This is an \(m \times n\) matrix where each entry is the partial derivative with respect to that element.
Example: Frobenius Norm Squared
Example: Trace
For square \(X \in \mathbb{R}^{n \times n}\):
Matrix-Matrix Product
Special case: \(f(X) = \text{tr}(AX)\) gives \(\frac{\partial f}{\partial X} = A^T\)
Chain Rule for Vectors
If \(y\) = g(\(x\)) and z = f(\(y\)), then:
where \(\frac{\partial y}{\partial x}\) is the Jacobian matrix:
Example: Composition
\(y = Ax\) (linear transformation)
Verify directly: \(z = x^TA^TAx\), so \(\nabla_x z = 2A^TAx\) ✓
Denominator Layout vs. Numerator Layout
There are two conventions:
Numerator layout (used here): \(\frac{\partial f}{\partial x}\) is a column vector if \(x\) is a column vector.
Denominator layout: \(\frac{\partial f}{\partial x}\) is a row vector if \(x\) is a column vector.
Always verify which convention is used. We use numerator layout throughout.
Common Derivatives
| Function f(\(x\)) | Gradient ∇_x f |
|---|---|
| \(a^Tx\) | \(a\) |
| \(x^TAx\) (symmetric \(A\)) | \(2Ax\) |
| \(x^Tx\) | \(2x\) |
| \(\|x\|_2\) | \(x/\|x\|_2\) |
| \(\|Ax - b\|_2^2\) | \(2A^T(Ax - b)\) |
| \(\log(\exp(x)^T \mathbf{1})\) | \(\text{softmax}(x)\) |
Softmax Function
For \(x \in \mathbb{R}^{n}\), the softmax is:
The Jacobian is:
In matrix form:
where \(s = \text{softmax}(x)\).
Logistic Loss
where \(y \in \{-1, 1\}\) is the label.
where \(\sigma(z) = \frac{1}{1 + \exp(-z)}\) is the sigmoid function.
Cross-Entropy Loss
For classification with softmax:
where \(y\) is one-hot encoded and \(\hat{y} = \text{softmax}(z)\) with \(z = W^Tx\).
This clean derivative is why softmax + cross-entropy is used together.
Backpropagation
In a neural network with layers \(h_1 = \sigma_1(W_1 x)\), \(h_2 = \sigma_2(W_2 h_1)\), ..., the gradient of loss \(L\) with respect to \(W_1\) is computed via the chain rule:
Backpropagation efficiently computes these gradients by caching intermediate results.
Hessian Matrix
The Hessian of f: \(\mathbb{R}^{n}\) → \(\mathbb{R}\) is the matrix of second derivatives:
If f is twice differentiable, \(H\) is symmetric.
Example
Positive Definite Hessian
If \(H\) is positive definite at a critical point (\(\nabla f = \mathbf{0}\)), the point is a local minimum.
Newton's method uses the Hessian to find minima:
Relevance for Machine Learning
Gradient Descent: Updates parameters via \(w \leftarrow w - \alpha \nabla_w L\). Computing \(\nabla_w L\) requires matrix calculus.
Backpropagation: Efficiently computes gradients in neural networks using the chain rule.
Second-Order Optimization: Methods like Newton's method and L-BFGS use Hessian information for faster convergence.
Regularization Gradients: Ridge (\(L_2\)) adds \(\lambda \|w\|_2^2\) to the loss, contributing \(2\lambda w\) to the gradient. Lasso (\(L_1\)) adds \(\lambda \|w\|_1\), contributing \(\lambda \text{sign}(w)\) (subgradient).
Batch Normalization: Gradients through normalization layers involve matrix derivatives.
Attention Mechanisms: Computing gradients of attention scores (softmax of \(QK^T\)) uses matrix calculus.
Variational Inference: Reparameterization trick requires gradients through sampling operations.
Adversarial Training: Computing adversarial perturbations uses gradients: \(\delta = \epsilon \text{sign}(\nabla_x L)\).
Jacobian Regularization: Penalizing \(\|A\|_F^2\) encourages smooth functions, used in robust models.
Eigenvalue Gradient: Differentiating eigenvalues/eigenvectors (e.g., for spectral normalization) uses advanced matrix calculus.