Singular Value Decomposition (SVD)
Definition
The Singular Value Decomposition (SVD) of a matrix \(A \in \mathbb{R}^{m \times n}\) is:
where:
- \(U \in \mathbb{R}^{m \times m}\) is orthogonal (columns are orthonormal left singular vectors)
- \(\Sigma \in \mathbb{R}^{m \times n}\) is diagonal (entries \(\sigma_1 \geq \sigma_2 \geq ... \geq \sigma_r \geq 0\) are singular values)
- \(V \in \mathbb{R}^{n \times n}\) is orthogonal (columns are orthonormal right singular vectors)
Unlike eigendecomposition, SVD applies to any matrix (not just square).
Singular Values
The singular values \(\sigma_i\) are the square roots of the eigenvalues of \(A^TA\) (or \(AA^T\)):
The number of non-zero singular values equals rank(\(A\)).
Computing SVD
To find \(U\), \(\Sigma\), \(V\):
- Compute \(A^TA \in \mathbb{R}^{n \times n}\) (symmetric, positive semi-definite)
- Find eigenvalues and eigenvectors of \(A^TA\)
- Singular values: \(\sigma_i = \sqrt{\lambda_i}\)
- Right singular vectors: columns of \(V\) are eigenvectors of \(A^TA\)
- Left singular vectors: \(u_i = (1/\sigma_i)Av_i\) for \(i = 1, \ldots, r\)
- \(U_r \in \mathbb{R}^{m \times r}\) (first r columns of \(U\))
- \(\Sigma_r \in \mathbb{R}^{r \times r}\) (r \(\times\) r diagonal matrix)
- \(V_r \in \mathbb{R}^{n \times r}\) (first r columns of \(V\))
- rank(\(A\)) = number of non-zero singular values
- \(\|A\|_2 = \sigma_{1}\) (largest singular value)
- \(\|A\|_F = \sqrt{\sigma_{1}^2 + \sigma_{2}^2 + ... + \sigma_r^2}\)
- \(|\det(A)| = \prod \sigma_{i}\) (for square A)
- \(A^TA = V \Sigma^T \Sigma V^T\)
- \(AA^T = U \Sigma \Sigma^T U^T\)
- Singular values are absolute values of eigenvalues: \(\sigma_{i} = |\lambda_{i}|\)
- Singular vectors are eigenvectors (up to sign)
- SVD: \(A = U \Sigma U^T\) (since \(U = V\) for symmetric matrices)
- \(U\): word-topic relationships
- \(\Sigma\): topic strengths
- \(V\): document-topic relationships
Example
A = [3 1]
[1 3]
[0 0]
\(A\) is \(3 \times 2\).
Step 1: Compute \(A^TA\):
AᵀA = [3 1 0][3 1] [9+1+0 3+3+0] [10 6]
[1 3 0][1 3] = [3+3+0 1+9+0] = [6 10]
[0 0]
Step 2: Find eigenvalues of \(A^TA\):
det(AᵀA - λI) = det([10-λ 6 ])
[6 10-λ]
= (10-λ)² - 36
= λ² - 20λ + 100 - 36
= λ² - 20λ + 64
= (λ - 16)(λ - 4)
Eigenvalues: \(\lambda_{1} = 16, \lambda_{2} = 4\)
Step 3: Singular values:
\(\sigma_1 = \sqrt{16} = 4, \sigma_2 = \sqrt{4} = 2\)
Σ = [4 0]
[0 2]
[0 0]
Step 4: Find eigenvectors of \(A^TA\):
For \(\lambda_{1} = 16\):
(AᵀA - 16I)v = [-6 6][v₁] [0]
[6 -6][v₂] = [0]
-v₁ + v₂ = 0 → v₁ = v₂
Normalized: \(v_{1} = [1/\sqrt{2}, 1/\sqrt{2}]^T\)
For \(\lambda_{2} = 4\):
(AᵀA - 4I)v = [6 6][v₁] [0]
[6 6][v₂] = [0]
v₁ + v₂ = 0 → v₂ = -v₁
Normalized: \(v_{2} = [1/\sqrt{2}, -1/\sqrt{2}]^T\)
V = [1/√2 1/√2]
[1/√2 -1/√2]
Step 5: Compute left singular vectors:
u₁ = (1/σ₁)Av₁ = (1/4)[3 1][1/√2] (1/4)[4/√2] [1/√2]
[1 3][1/√2] = (1/4)[4/√2] = [1/√2]
[0 0] (1/4)[0 ] [0 ]
u₂ = (1/σ₂)Av₂ = (1/2)[3 1][ 1/√2] (1/2)[2/√2] [1/√2 ]
[1 3][-1/√2] = (1/2)[-2/√2] = [-1/√2]
[0 0] (1/2)[0 ] [0 ]
Need a third vector orthogonal to \(u_1\) and \(u_2\):
\(u_3 = [0, 0, 1]^T\) (perpendicular to span\(\{u_1, u_2\}\))
U = [1/√2 1/√2 0]
[1/√2 -1/√2 0]
[0 0 1]
Verify:
UΣVᵀ = [1/√2 1/√2 0][4 0][1/√2 1/√2]
[1/√2 -1/√2 0][0 2][1/√2 -1/√2]
[0 0 1][0 0]
= [1/√2 1/√2 0][4/√2 4/√2]
[1/√2 -1/√2 0][2/√2 -2/√2]
[0 0 1][0 0 ]
= [4/2+2/2 4/2-2/2] [3 1]
[4/2-2/2 4/2+2/2] = [1 3] ✓
[0 0 ] [0 0]
Reduced SVD
For \(A \in \mathbb{R}^{m \times n}\) with rank r \(\leq\) min(m, n), the reduced SVD uses only the r non-zero singular values:
where:
This is more memory-efficient for low-rank matrices.
Outer Product Form
SVD can be written as a sum of rank-1 matrices:
Each term \(\sigma_i u_{i} v_{i}^T\) captures a "mode" or "pattern" in the data.
Low-Rank Approximation
The best rank-k approximation to \(A\) (minimizing Frobenius norm) is:
where \(U_k\), \(\Sigma_k\), \(V_k\) use only the first k singular values/vectors.
The approximation error is:
Moore-Penrose Pseudo-Inverse
The pseudo-inverse of \(A\) is:
where \(\Sigma^+\) has \(1/\sigma_{i}\) on the diagonal for non-zero \(\sigma_{i}\), and 0 elsewhere.
For overdetermined systems \(Ax = b\) (m > n), the least squares solution is:
Properties of SVD
Relationship to Eigendecomposition
For symmetric \(A = A^T\):
Relevance for Machine Learning
Principal Component Analysis (PCA): Instead of eigendecomposing the covariance matrix, apply SVD directly to the centered data matrix \(X\):
Principal components: columns of \(V\)
Variance explained: \(\sigma_{i}^2 /(m-1)\)
Reduced data: \(XV_{k}\) (project onto top k components)
Latent Semantic Analysis (LSA): In text mining, create a term-document matrix \(A\) (rows = words, columns = documents). SVD decomposes \(A\) into:
Recommender Systems: Matrix factorization for collaborative filtering. User-item matrix \(R \approx U \Sigma V^T\), where \(U\) contains user embeddings and \(V\) contains item embeddings.
Image Compression: Represent an image matrix \(A\) with low-rank approximation \(A_k\). Keep only k largest singular values/vectors, reducing storage from mn to k(m+n+1).
Denoising: If signal is low-rank and noise is full-rank, truncated SVD removes noise by discarding small singular values.
Least Squares Regression: Solve \(Xw = y\) using pseudo-inverse: \(w = X^+y\). More numerically stable than \((X^TX)^{-1}X^Ty\).
Condition Number: \(\text{cond}(A) = \sigma_{1}/\sigma_r\) quantifies numerical stability. Large condition numbers indicate ill-conditioned problems requiring regularization.
Tensor Decomposition: Higher-order SVD (HOSVD) extends SVD to tensors, used in multiway data analysis and deep learning.