Matrix Decompositions: Breaking Matrices into Simpler Pieces

Master SVD, LU, QR, Cholesky, and eigendecomposition — the factorizations that power dimensionality reduction, compression, and numerical algorithms.

Linear Algebra March 6, 2026 10 min read

Introduction

Matrix decompositions factor a matrix into a product of simpler, structured matrices. Just as prime factorization reveals the building blocks of an integer, matrix factorizations reveal the geometric and algebraic structure hidden inside a matrix.

In ML, decompositions are not just theoretical — they are computational workhorses. SVD powers PCA and recommendation systems. QR solves least squares stably. Cholesky enables efficient sampling from Gaussians. This article brings together ideas from eigenvalues and orthogonality to present the most important factorizations.

Overview of Key Decompositions

DecompositionFormRequirementsPrimary Use
EigendecompositionA=PDP1\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}Square, diagonalizableSpectral analysis, matrix powers
SVDA=UΣVT\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^TAny matrixPCA, compression, pseudoinverse
LUA=LU\mathbf{A} = \mathbf{L}\mathbf{U}SquareSolving linear systems
QRA=QR\mathbf{A} = \mathbf{Q}\mathbf{R}Any matrixLeast squares, eigenvalue algorithms
CholeskyA=LLT\mathbf{A} = \mathbf{L}\mathbf{L}^TSymmetric positive definiteEfficient solving, sampling

Singular Value Decomposition (SVD)

The singular value decomposition is the most important and versatile matrix decomposition. Every matrix — any shape, any rank — has an SVD.

Definition

For any ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}:

A=UΣVT\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T

where:

  • URm×m\mathbf{U} \in \mathbb{R}^{m \times m} — orthogonal matrix of left singular vectors
  • ΣRm×n\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n} — diagonal matrix of singular values σ1σ2σr>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0
  • VRn×n\mathbf{V} \in \mathbb{R}^{n \times n} — orthogonal matrix of right singular vectors
  • r=rank(A)r = \text{rank}(\mathbf{A}) is the number of nonzero singular values
Σ=[σ10000σ20000σr00000]\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_1 & 0 & \cdots & 0 & 0 \\ 0 & \sigma_2 & \cdots & 0 & 0 \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & \cdots & \sigma_r & 0 \\ 0 & 0 & \cdots & 0 & 0 \\ \vdots & & & & \vdots \end{bmatrix}

Geometric Interpretation

The SVD says that every linear transformation can be decomposed into three steps:

  1. Rotate (by VT\mathbf{V}^T) — align with the input basis
  2. Scale (by Σ\boldsymbol{\Sigma}) — stretch/compress along each axis
  3. Rotate (by U\mathbf{U}) — align with the output basis

Any linear transformation, no matter how complex, is just a rotation, a scaling, and another rotation.

Key insight: The singular values σi\sigma_i are the “importance” of each dimension. Large singular values indicate directions with high variance or strong signal; small ones indicate noise or redundancy. This is why truncating small singular values gives compression and denoising.

Connection to Eigenvalues

The SVD is intimately related to eigendecomposition:

  • The singular values of A\mathbf{A} are the square roots of the eigenvalues of ATA\mathbf{A}^T\mathbf{A} (or AAT\mathbf{A}\mathbf{A}^T):
σi=λi(ATA)\sigma_i = \sqrt{\lambda_i(\mathbf{A}^T\mathbf{A})}
  • The right singular vectors V\mathbf{V} are eigenvectors of ATA\mathbf{A}^T\mathbf{A}
  • The left singular vectors U\mathbf{U} are eigenvectors of AAT\mathbf{A}\mathbf{A}^T

For a symmetric matrix, SVD and eigendecomposition coincide (up to signs), with singular values being the absolute values of eigenvalues.

Compact and Truncated SVD

The compact SVD keeps only the rr nonzero singular values:

A=UrΣrVrT\mathbf{A} = \mathbf{U}_r \boldsymbol{\Sigma}_r \mathbf{V}_r^T

where UrRm×r\mathbf{U}_r \in \mathbb{R}^{m \times r}, ΣrRr×r\boldsymbol{\Sigma}_r \in \mathbb{R}^{r \times r}, VrRn×r\mathbf{V}_r \in \mathbb{R}^{n \times r}.

The truncated SVD keeps only the top k<rk < r singular values:

Ak=UkΣkVkT\mathbf{A}_k = \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^T

This Ak\mathbf{A}_k is the best rank-kk approximation to A\mathbf{A} in both the Frobenius norm and the spectral norm. This is the Eckart-Young-Mirsky theorem.

The approximation error is:

AAkF=σk+12++σr2\|\mathbf{A} - \mathbf{A}_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}

Key insight: This is why PCA works. The data matrix (centered) has an SVD. Keeping the top kk singular values/vectors gives the best kk-dimensional representation of the data — the one that preserves the most variance.

SVD in Practice

import numpy as np

# Random data matrix: 100 samples, 5 features
np.random.seed(42)
X = np.random.randn(100, 5)

# Full SVD
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)

print(f"U shape: {U.shape}")       # (100, 5)
print(f"Sigma: {sigma}")           # 5 singular values
print(f"Vt shape: {Vt.shape}")     # (5, 5)

# Reconstruct with top 3 components
k = 3
X_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
error = np.linalg.norm(X - X_approx, 'fro')
total = np.linalg.norm(X, 'fro')
print(f"Relative error: {error/total:.4f}")

The Pseudoinverse

The SVD gives us the Moore-Penrose pseudoinverse A+\mathbf{A}^+, which generalizes the inverse to non-square and singular matrices:

A+=VΣ+UT\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T

where Σ+\boldsymbol{\Sigma}^+ is formed by reciprocating the nonzero singular values:

σi+={1/σiif σi>00if σi=0\sigma_i^+ = \begin{cases} 1/\sigma_i & \text{if } \sigma_i > 0 \\ 0 & \text{if } \sigma_i = 0 \end{cases}

The pseudoinverse gives the least squares solution: x^=A+b\hat{\mathbf{x}} = \mathbf{A}^+\mathbf{b} minimizes Axb2\|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2, and among all minimizers, it has the smallest x2\|\mathbf{x}\|_2.

LU Decomposition

LU decomposition factors a square matrix into lower and upper triangular factors:

A=LU\mathbf{A} = \mathbf{L}\mathbf{U}

In practice, row permutations are often needed for numerical stability:

PA=LU\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}

where P\mathbf{P} is a permutation matrix (partial pivoting).

When to Use LU

LU is the standard method for solving square linear systems Ax=b\mathbf{A}\mathbf{x} = \mathbf{b}:

  1. Factor A=LU\mathbf{A} = \mathbf{L}\mathbf{U} once (O(n3)O(n^3))
  2. Solve Ly=b\mathbf{L}\mathbf{y} = \mathbf{b} via forward substitution (O(n2)O(n^2))
  3. Solve Ux=y\mathbf{U}\mathbf{x} = \mathbf{y} via back substitution (O(n2)O(n^2))

If you need to solve for many different b\mathbf{b} vectors, the factorization is computed once and reused.

from scipy.linalg import lu_factor, lu_solve

A = np.array([[2, 1, 1],
              [4, 3, 3],
              [8, 7, 9]], dtype=float)

# Factor once
lu, piv = lu_factor(A)

# Solve for multiple right-hand sides
b1 = np.array([4, 10, 24])
b2 = np.array([1, 2, 3])

x1 = lu_solve((lu, piv), b1)
x2 = lu_solve((lu, piv), b2)

QR Decomposition

QR decomposition factors any m×nm \times n matrix (mnm \geq n) into:

A=QR\mathbf{A} = \mathbf{Q}\mathbf{R}

where QRm×n\mathbf{Q} \in \mathbb{R}^{m \times n} has orthonormal columns and RRn×n\mathbf{R} \in \mathbb{R}^{n \times n} is upper triangular.

QR for Least Squares

QR provides the most numerically stable method for solving least squares:

Axb    QRx=b    Rx=QTb\mathbf{A}\mathbf{x} \approx \mathbf{b} \implies \mathbf{Q}\mathbf{R}\mathbf{x} = \mathbf{b} \implies \mathbf{R}\mathbf{x} = \mathbf{Q}^T\mathbf{b}

The last equation is a triangular system — easy and stable to solve. No need to form ATA\mathbf{A}^T\mathbf{A} (which squares the condition number).

QR Methods

MethodCostStabilityNotes
Classical Gram-SchmidtO(mn2)O(mn^2)PoorLoses orthogonality for ill-conditioned A\mathbf{A}
Modified Gram-SchmidtO(mn2)O(mn^2)GoodRecomputes projections at each step
Householder reflectionsO(2mn22n3/3)O(2mn^2 - 2n^3/3)ExcellentStandard in LAPACK; most commonly used
Givens rotationsO(mn2)O(mn^2)ExcellentGood for sparse matrices
Q, R = np.linalg.qr(X)

# Solve least squares via QR
b = np.random.randn(100)
x_ls = np.linalg.solve(R, Q.T @ b)

Cholesky Decomposition

For a symmetric positive definite matrix A\mathbf{A}, the Cholesky decomposition gives:

A=LLT\mathbf{A} = \mathbf{L}\mathbf{L}^T

where L\mathbf{L} is lower triangular with positive diagonal entries. This is unique.

Why Cholesky Is Special

  • Twice as fast as LU (exploits symmetry): O(n3/3)O(n^3/3) vs. O(2n3/3)O(2n^3/3)
  • Numerically stable without pivoting
  • Existence test: If the decomposition fails, A\mathbf{A} is not positive definite

Applications in ML

Cholesky is the workhorse for symmetric positive definite systems:

  • Normal equation: (XTX)w=XTy(\mathbf{X}^T\mathbf{X})\mathbf{w} = \mathbf{X}^T\mathbf{y} — the matrix XTX\mathbf{X}^T\mathbf{X} is SPD (assuming full rank)
  • Gaussian processes: Kernel matrix K\mathbf{K} is SPD; Cholesky solves the prediction equations
  • Sampling from multivariate Gaussians: To sample xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}), compute L\mathbf{L} where Σ=LLT\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^T, then x=μ+Lz\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\mathbf{z} where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
# Sampling from a multivariate Gaussian
mean = np.array([1, 2])
cov = np.array([[2, 1],
                [1, 3]])

L = np.linalg.cholesky(cov)
z = np.random.randn(2)
sample = mean + L @ z

Key insight: Cholesky is the preferred method whenever you have an SPD matrix. In Gaussian processes, it is the dominant computational cost — factoring an n×nn \times n kernel matrix costs O(n3)O(n^3), which limits GP scalability.

Eigendecomposition vs. SVD

FeatureEigendecompositionSVD
Applies toSquare matricesAny matrix
FactorsPDP1\mathbf{P}\mathbf{D}\mathbf{P}^{-1}UΣVT\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T
Orthogonal factors?Only if symmetricAlways
Values can be negative?Yes (eigenvalues)No (singular values 0\geq 0)
ExistenceNot always (non-diagonalizable)Always exists
Primary ML usePCA (via covariance), spectral methodsPCA (direct), compression, pseudoinverse

For PCA, you can either:

  1. Compute the covariance matrix and eigendecompose it
  2. Compute the SVD of the centered data matrix directly

Option 2 is more numerically stable and avoids forming XTX\mathbf{X}^T\mathbf{X}.

Low-Rank Approximation

Many real-world matrices are approximately low-rank — a few singular values dominate while the rest are small. The truncated SVD exploits this:

AAk=i=1kσiuiviT\mathbf{A} \approx \mathbf{A}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T

This outer product form shows that a rank-kk matrix is a sum of kk rank-1 matrices, each weighted by a singular value.

Storage savings: Instead of storing mnmn entries, a rank-kk approximation needs only (m+n)k+k(m + n)k + k entries.

Applications

  • Image compression: Keep the top kk singular values of the pixel matrix
  • Recommendation systems: Approximate the user-item matrix as low-rank (matrix factorization)
  • Latent Semantic Analysis: Truncated SVD of the term-document matrix reveals latent topics
  • Denoising: Small singular values often correspond to noise; truncating them cleans the signal
from sklearn.decomposition import TruncatedSVD

# Dimensionality reduction via truncated SVD
svd = TruncatedSVD(n_components=50)
X_reduced = svd.fit_transform(X_sparse)

# Variance explained
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.2%}")

Worked Example: PCA via SVD

Perform PCA on 2D data using SVD.

import numpy as np

# Generate correlated 2D data
np.random.seed(0)
X = np.random.randn(200, 2) @ np.array([[2, 1], [1, 3]])

# Center the data
X_centered = X - X.mean(axis=0)

# SVD
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Principal components are rows of Vt
print("PC1 direction:", Vt[0])
print("PC2 direction:", Vt[1])

# Variance explained by each component
variance_explained = sigma**2 / (sigma**2).sum()
print(f"PC1 explains {variance_explained[0]:.1%} of variance")
print(f"PC2 explains {variance_explained[1]:.1%} of variance")

# Project onto first principal component
X_1d = X_centered @ Vt[0]

Choosing the Right Decomposition

SituationBest DecompositionReason
Solve Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} (square)LUFast, general
Solve Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} (SPD)Cholesky2x faster, stable
Least squares Axb\mathbf{A}\mathbf{x} \approx \mathbf{b}QRNumerically stable
Dimensionality reductionSVD / EigendecompositionOptimal low-rank approximation
Spectral analysisEigendecompositionReveals intrinsic structure
Matrix compressionTruncated SVDBest rank-kk approximation
Sampling from GaussianCholeskyEfficient and stable
PseudoinverseSVDHandles any matrix

Summary

  • SVD (A=UΣVT\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T) works for any matrix and reveals its geometric structure as rotation-scaling-rotation.
  • Truncated SVD gives the optimal low-rank approximation, powering PCA, compression, and denoising.
  • LU factors a square matrix for efficient linear system solving.
  • QR provides numerically stable least squares via orthogonal factorization.
  • Cholesky (A=LLT\mathbf{A} = \mathbf{L}\mathbf{L}^T) is the fastest decomposition for symmetric positive definite matrices.
  • The pseudoinverse A+=VΣ+UT\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T generalizes the inverse via SVD.
  • In ML, the right decomposition choice determines both correctness and computational efficiency.
  • Next, we bring everything together in linear algebra in machine learning, connecting all these tools to real ML algorithms.

References

  • Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
  • Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.
  • Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2. MIT Press. deeplearningbook.org

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay