Matrix Decompositions: Breaking Matrices into Simpler Pieces

Linear Algebra Series 8 / 13

Introduction

Matrix decompositions factor a matrix into a product of simpler, structured matrices. Just as prime factorization reveals the building blocks of an integer, matrix factorizations reveal the geometric and algebraic structure hidden inside a matrix.

In ML, decompositions are not just theoretical — they are computational workhorses. SVD powers PCA and recommendation systems. QR solves least squares stably. Cholesky enables efficient sampling from Gaussians. This article brings together ideas from eigenvalues and orthogonality to present the most important factorizations.

Overview of Key Decompositions

Decomposition	Form	Requirements	Primary Use
Eigendecomposition	$\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}$	Square, diagonalizable	Spectral analysis, matrix powers
SVD	$\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$	Any matrix	PCA, compression, pseudoinverse
LU	$\mathbf{A} = \mathbf{L}\mathbf{U}$	Square	Solving linear systems
QR	$\mathbf{A} = \mathbf{Q}\mathbf{R}$	Any matrix	Least squares, eigenvalue algorithms
Cholesky	$\mathbf{A} = \mathbf{L}\mathbf{L}^T$	Symmetric positive definite	Efficient solving, sampling

Singular Value Decomposition (SVD)

The singular value decomposition is the most important and versatile matrix decomposition. Every matrix — any shape, any rank — has an SVD.

Definition

For any $\mathbf{A} \in \mathbb{R}^{m \times n}$ :

\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T

where:

$\mathbf{U} \in \mathbb{R}^{m \times m}$ — orthogonal matrix of left singular vectors
$\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}$ — diagonal matrix of singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$
$\mathbf{V} \in \mathbb{R}^{n \times n}$ — orthogonal matrix of right singular vectors
$r = \text{rank}(\mathbf{A})$ is the number of nonzero singular values

\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_1 & 0 & \cdots & 0 & 0 \\ 0 & \sigma_2 & \cdots & 0 & 0 \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & \cdots & \sigma_r & 0 \\ 0 & 0 & \cdots & 0 & 0 \\ \vdots & & & & \vdots \end{bmatrix}

Geometric Interpretation

The SVD says that every linear transformation can be decomposed into three steps:

Rotate (by $\mathbf{V}^T$ ) — align with the input basis
Scale (by $\boldsymbol{\Sigma}$ ) — stretch/compress along each axis
Rotate (by $\mathbf{U}$ ) — align with the output basis

Any linear transformation, no matter how complex, is just a rotation, a scaling, and another rotation.

Key insight: The singular values $\sigma_i$ are the “importance” of each dimension. Large singular values indicate directions with high variance or strong signal; small ones indicate noise or redundancy. This is why truncating small singular values gives compression and denoising.

Connection to Eigenvalues

The SVD is intimately related to eigendecomposition:

The singular values of $\mathbf{A}$ are the square roots of the eigenvalues of $\mathbf{A}^T\mathbf{A}$ (or $\mathbf{A}\mathbf{A}^T$ ):

\sigma_i = \sqrt{\lambda_i(\mathbf{A}^T\mathbf{A})}

The right singular vectors $\mathbf{V}$ are eigenvectors of $\mathbf{A}^T\mathbf{A}$
The left singular vectors $\mathbf{U}$ are eigenvectors of $\mathbf{A}\mathbf{A}^T$

For a symmetric matrix, SVD and eigendecomposition coincide (up to signs), with singular values being the absolute values of eigenvalues.

Compact and Truncated SVD

The compact SVD keeps only the $r$ nonzero singular values:

\mathbf{A} = \mathbf{U}_r \boldsymbol{\Sigma}_r \mathbf{V}_r^T

where $\mathbf{U}_r \in \mathbb{R}^{m \times r}$ , $\boldsymbol{\Sigma}_r \in \mathbb{R}^{r \times r}$ , $\mathbf{V}_r \in \mathbb{R}^{n \times r}$ .

The truncated SVD keeps only the top $k < r$ singular values:

\mathbf{A}_k = \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^T

This $\mathbf{A}_k$ is the best rank- $k$ approximation to $\mathbf{A}$ in both the Frobenius norm and the spectral norm. This is the Eckart-Young-Mirsky theorem.

The approximation error is:

\|\mathbf{A} - \mathbf{A}_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}

Key insight: This is why PCA works. The data matrix (centered) has an SVD. Keeping the top $k$ singular values/vectors gives the best $k$ -dimensional representation of the data — the one that preserves the most variance.

SVD in Practice

import numpy as np

# Random data matrix: 100 samples, 5 features
np.random.seed(42)
X = np.random.randn(100, 5)

# Full SVD
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)

print(f"U shape: {U.shape}")       # (100, 5)
print(f"Sigma: {sigma}")           # 5 singular values
print(f"Vt shape: {Vt.shape}")     # (5, 5)

# Reconstruct with top 3 components
k = 3
X_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
error = np.linalg.norm(X - X_approx, 'fro')
total = np.linalg.norm(X, 'fro')
print(f"Relative error: {error/total:.4f}")

The Pseudoinverse

The SVD gives us the Moore-Penrose pseudoinverse $\mathbf{A}^+$ , which generalizes the inverse to non-square and singular matrices:

\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T

where $\boldsymbol{\Sigma}^+$ is formed by reciprocating the nonzero singular values:

\sigma_i^+ = \begin{cases} 1/\sigma_i & \text{if } \sigma_i > 0 \\ 0 & \text{if } \sigma_i = 0 \end{cases}

The pseudoinverse gives the least squares solution: $\hat{\mathbf{x}} = \mathbf{A}^+\mathbf{b}$ minimizes $\|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2$ , and among all minimizers, it has the smallest $\|\mathbf{x}\|_2$ .

LU Decomposition

LU decomposition factors a square matrix into lower and upper triangular factors:

\mathbf{A} = \mathbf{L}\mathbf{U}

In practice, row permutations are often needed for numerical stability:

\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}

where $\mathbf{P}$ is a permutation matrix (partial pivoting).

When to Use LU

LU is the standard method for solving square linear systems $\mathbf{A}\mathbf{x} = \mathbf{b}$ :

Factor $\mathbf{A} = \mathbf{L}\mathbf{U}$ once ( $O(n^3)$ )
Solve $\mathbf{L}\mathbf{y} = \mathbf{b}$ via forward substitution ( $O(n^2)$ )
Solve $\mathbf{U}\mathbf{x} = \mathbf{y}$ via back substitution ( $O(n^2)$ )

If you need to solve for many different $\mathbf{b}$ vectors, the factorization is computed once and reused.

from scipy.linalg import lu_factor, lu_solve

A = np.array([[2, 1, 1],
              [4, 3, 3],
              [8, 7, 9]], dtype=float)

# Factor once
lu, piv = lu_factor(A)

# Solve for multiple right-hand sides
b1 = np.array([4, 10, 24])
b2 = np.array([1, 2, 3])

x1 = lu_solve((lu, piv), b1)
x2 = lu_solve((lu, piv), b2)

QR Decomposition

QR decomposition factors any $m \times n$ matrix ( $m \geq n$ ) into:

\mathbf{A} = \mathbf{Q}\mathbf{R}

where $\mathbf{Q} \in \mathbb{R}^{m \times n}$ has orthonormal columns and $\mathbf{R} \in \mathbb{R}^{n \times n}$ is upper triangular.

QR for Least Squares

QR provides the most numerically stable method for solving least squares:

\mathbf{A}\mathbf{x} \approx \mathbf{b} \implies \mathbf{Q}\mathbf{R}\mathbf{x} = \mathbf{b} \implies \mathbf{R}\mathbf{x} = \mathbf{Q}^T\mathbf{b}

The last equation is a triangular system — easy and stable to solve. No need to form $\mathbf{A}^T\mathbf{A}$ (which squares the condition number).

QR Methods

Method	Cost	Stability	Notes
Classical Gram-Schmidt	$O(mn^2)$	Poor	Loses orthogonality for ill-conditioned $\mathbf{A}$
Modified Gram-Schmidt	$O(mn^2)$	Good	Recomputes projections at each step
Householder reflections	$O(2mn^2 - 2n^3/3)$	Excellent	Standard in LAPACK; most commonly used
Givens rotations	$O(mn^2)$	Excellent	Good for sparse matrices

Q, R = np.linalg.qr(X)

# Solve least squares via QR
b = np.random.randn(100)
x_ls = np.linalg.solve(R, Q.T @ b)

Cholesky Decomposition

For a symmetric positive definite matrix $\mathbf{A}$ , the Cholesky decomposition gives:

\mathbf{A} = \mathbf{L}\mathbf{L}^T

where $\mathbf{L}$ is lower triangular with positive diagonal entries. This is unique.

Why Cholesky Is Special

Twice as fast as LU (exploits symmetry): $O(n^3/3)$ vs. $O(2n^3/3)$
Numerically stable without pivoting
Existence test: If the decomposition fails, $\mathbf{A}$ is not positive definite

Applications in ML

Cholesky is the workhorse for symmetric positive definite systems:

Normal equation: $(\mathbf{X}^T\mathbf{X})\mathbf{w} = \mathbf{X}^T\mathbf{y}$ — the matrix $\mathbf{X}^T\mathbf{X}$ is SPD (assuming full rank)
Gaussian processes: Kernel matrix $\mathbf{K}$ is SPD; Cholesky solves the prediction equations
Sampling from multivariate Gaussians: To sample $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ , compute $\mathbf{L}$ where $\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^T$ , then $\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\mathbf{z}$ where $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$

# Sampling from a multivariate Gaussian
mean = np.array([1, 2])
cov = np.array([[2, 1],
                [1, 3]])

L = np.linalg.cholesky(cov)
z = np.random.randn(2)
sample = mean + L @ z

Key insight: Cholesky is the preferred method whenever you have an SPD matrix. In Gaussian processes, it is the dominant computational cost — factoring an $n \times n$ kernel matrix costs $O(n^3)$ , which limits GP scalability.

Eigendecomposition vs. SVD

Feature	Eigendecomposition	SVD
Applies to	Square matrices	Any matrix
Factors	$\mathbf{P}\mathbf{D}\mathbf{P}^{-1}$	$\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$
Orthogonal factors?	Only if symmetric	Always
Values can be negative?	Yes (eigenvalues)	No (singular values $\geq 0$ )
Existence	Not always (non-diagonalizable)	Always exists
Primary ML use	PCA (via covariance), spectral methods	PCA (direct), compression, pseudoinverse

For PCA, you can either:

Compute the covariance matrix and eigendecompose it
Compute the SVD of the centered data matrix directly

Option 2 is more numerically stable and avoids forming $\mathbf{X}^T\mathbf{X}$ .

Low-Rank Approximation

Many real-world matrices are approximately low-rank — a few singular values dominate while the rest are small. The truncated SVD exploits this:

\mathbf{A} \approx \mathbf{A}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T

This outer product form shows that a rank- $k$ matrix is a sum of $k$ rank-1 matrices, each weighted by a singular value.

Storage savings: Instead of storing $mn$ entries, a rank- $k$ approximation needs only $(m + n)k + k$ entries.

Applications

Image compression: Keep the top $k$ singular values of the pixel matrix
Recommendation systems: Approximate the user-item matrix as low-rank (matrix factorization)
Latent Semantic Analysis: Truncated SVD of the term-document matrix reveals latent topics
Denoising: Small singular values often correspond to noise; truncating them cleans the signal

from sklearn.decomposition import TruncatedSVD

# Dimensionality reduction via truncated SVD
svd = TruncatedSVD(n_components=50)
X_reduced = svd.fit_transform(X_sparse)

# Variance explained
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.2%}")

Worked Example: PCA via SVD

Perform PCA on 2D data using SVD.

import numpy as np

# Generate correlated 2D data
np.random.seed(0)
X = np.random.randn(200, 2) @ np.array([[2, 1], [1, 3]])

# Center the data
X_centered = X - X.mean(axis=0)

# SVD
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Principal components are rows of Vt
print("PC1 direction:", Vt[0])
print("PC2 direction:", Vt[1])

# Variance explained by each component
variance_explained = sigma**2 / (sigma**2).sum()
print(f"PC1 explains {variance_explained[0]:.1%} of variance")
print(f"PC2 explains {variance_explained[1]:.1%} of variance")

# Project onto first principal component
X_1d = X_centered @ Vt[0]

Choosing the Right Decomposition

Situation	Best Decomposition	Reason
Solve $\mathbf{A}\mathbf{x} = \mathbf{b}$ (square)	LU	Fast, general
Solve $\mathbf{A}\mathbf{x} = \mathbf{b}$ (SPD)	Cholesky	2x faster, stable
Least squares $\mathbf{A}\mathbf{x} \approx \mathbf{b}$	QR	Numerically stable
Dimensionality reduction	SVD / Eigendecomposition	Optimal low-rank approximation
Spectral analysis	Eigendecomposition	Reveals intrinsic structure
Matrix compression	Truncated SVD	Best rank- $k$ approximation
Sampling from Gaussian	Cholesky	Efficient and stable
Pseudoinverse	SVD	Handles any matrix

Summary

SVD ( $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$ ) works for any matrix and reveals its geometric structure as rotation-scaling-rotation.
Truncated SVD gives the optimal low-rank approximation, powering PCA, compression, and denoising.
LU factors a square matrix for efficient linear system solving.
QR provides numerically stable least squares via orthogonal factorization.
Cholesky ( $\mathbf{A} = \mathbf{L}\mathbf{L}^T$ ) is the fastest decomposition for symmetric positive definite matrices.
The pseudoinverse $\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T$ generalizes the inverse via SVD.
In ML, the right decomposition choice determines both correctness and computational efficiency.
Next, we bring everything together in linear algebra in machine learning, connecting all these tools to real ML algorithms.

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2. MIT Press. deeplearningbook.org