- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
Matrix decompositions factor a matrix into a product of simpler, structured matrices. Just as prime factorization reveals the building blocks of an integer, matrix factorizations reveal the geometric and algebraic structure hidden inside a matrix.
In ML, decompositions are not just theoretical — they are computational workhorses. SVD powers PCA and recommendation systems. QR solves least squares stably. Cholesky enables efficient sampling from Gaussians. This article brings together ideas from eigenvalues and orthogonality to present the most important factorizations.
Overview of Key Decompositions
| Decomposition | Form | Requirements | Primary Use |
|---|---|---|---|
| Eigendecomposition | Square, diagonalizable | Spectral analysis, matrix powers | |
| SVD | Any matrix | PCA, compression, pseudoinverse | |
| LU | Square | Solving linear systems | |
| QR | Any matrix | Least squares, eigenvalue algorithms | |
| Cholesky | Symmetric positive definite | Efficient solving, sampling |
Singular Value Decomposition (SVD)
The singular value decomposition is the most important and versatile matrix decomposition. Every matrix — any shape, any rank — has an SVD.
Definition
For any :
where:
- — orthogonal matrix of left singular vectors
- — diagonal matrix of singular values
- — orthogonal matrix of right singular vectors
- is the number of nonzero singular values
Geometric Interpretation
The SVD says that every linear transformation can be decomposed into three steps:
- Rotate (by ) — align with the input basis
- Scale (by ) — stretch/compress along each axis
- Rotate (by ) — align with the output basis
Any linear transformation, no matter how complex, is just a rotation, a scaling, and another rotation.
Key insight: The singular values are the “importance” of each dimension. Large singular values indicate directions with high variance or strong signal; small ones indicate noise or redundancy. This is why truncating small singular values gives compression and denoising.
Connection to Eigenvalues
The SVD is intimately related to eigendecomposition:
- The singular values of are the square roots of the eigenvalues of (or ):
- The right singular vectors are eigenvectors of
- The left singular vectors are eigenvectors of
For a symmetric matrix, SVD and eigendecomposition coincide (up to signs), with singular values being the absolute values of eigenvalues.
Compact and Truncated SVD
The compact SVD keeps only the nonzero singular values:
where , , .
The truncated SVD keeps only the top singular values:
This is the best rank- approximation to in both the Frobenius norm and the spectral norm. This is the Eckart-Young-Mirsky theorem.
The approximation error is:
Key insight: This is why PCA works. The data matrix (centered) has an SVD. Keeping the top singular values/vectors gives the best -dimensional representation of the data — the one that preserves the most variance.
SVD in Practice
import numpy as np
# Random data matrix: 100 samples, 5 features
np.random.seed(42)
X = np.random.randn(100, 5)
# Full SVD
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
print(f"U shape: {U.shape}") # (100, 5)
print(f"Sigma: {sigma}") # 5 singular values
print(f"Vt shape: {Vt.shape}") # (5, 5)
# Reconstruct with top 3 components
k = 3
X_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
error = np.linalg.norm(X - X_approx, 'fro')
total = np.linalg.norm(X, 'fro')
print(f"Relative error: {error/total:.4f}")
The Pseudoinverse
The SVD gives us the Moore-Penrose pseudoinverse , which generalizes the inverse to non-square and singular matrices:
where is formed by reciprocating the nonzero singular values:
The pseudoinverse gives the least squares solution: minimizes , and among all minimizers, it has the smallest .
LU Decomposition
LU decomposition factors a square matrix into lower and upper triangular factors:
In practice, row permutations are often needed for numerical stability:
where is a permutation matrix (partial pivoting).
When to Use LU
LU is the standard method for solving square linear systems :
- Factor once ()
- Solve via forward substitution ()
- Solve via back substitution ()
If you need to solve for many different vectors, the factorization is computed once and reused.
from scipy.linalg import lu_factor, lu_solve
A = np.array([[2, 1, 1],
[4, 3, 3],
[8, 7, 9]], dtype=float)
# Factor once
lu, piv = lu_factor(A)
# Solve for multiple right-hand sides
b1 = np.array([4, 10, 24])
b2 = np.array([1, 2, 3])
x1 = lu_solve((lu, piv), b1)
x2 = lu_solve((lu, piv), b2)
QR Decomposition
QR decomposition factors any matrix () into:
where has orthonormal columns and is upper triangular.
QR for Least Squares
QR provides the most numerically stable method for solving least squares:
The last equation is a triangular system — easy and stable to solve. No need to form (which squares the condition number).
QR Methods
| Method | Cost | Stability | Notes |
|---|---|---|---|
| Classical Gram-Schmidt | Poor | Loses orthogonality for ill-conditioned | |
| Modified Gram-Schmidt | Good | Recomputes projections at each step | |
| Householder reflections | Excellent | Standard in LAPACK; most commonly used | |
| Givens rotations | Excellent | Good for sparse matrices |
Q, R = np.linalg.qr(X)
# Solve least squares via QR
b = np.random.randn(100)
x_ls = np.linalg.solve(R, Q.T @ b)
Cholesky Decomposition
For a symmetric positive definite matrix , the Cholesky decomposition gives:
where is lower triangular with positive diagonal entries. This is unique.
Why Cholesky Is Special
- Twice as fast as LU (exploits symmetry): vs.
- Numerically stable without pivoting
- Existence test: If the decomposition fails, is not positive definite
Applications in ML
Cholesky is the workhorse for symmetric positive definite systems:
- Normal equation: — the matrix is SPD (assuming full rank)
- Gaussian processes: Kernel matrix is SPD; Cholesky solves the prediction equations
- Sampling from multivariate Gaussians: To sample , compute where , then where
# Sampling from a multivariate Gaussian
mean = np.array([1, 2])
cov = np.array([[2, 1],
[1, 3]])
L = np.linalg.cholesky(cov)
z = np.random.randn(2)
sample = mean + L @ z
Key insight: Cholesky is the preferred method whenever you have an SPD matrix. In Gaussian processes, it is the dominant computational cost — factoring an kernel matrix costs , which limits GP scalability.
Eigendecomposition vs. SVD
| Feature | Eigendecomposition | SVD |
|---|---|---|
| Applies to | Square matrices | Any matrix |
| Factors | ||
| Orthogonal factors? | Only if symmetric | Always |
| Values can be negative? | Yes (eigenvalues) | No (singular values ) |
| Existence | Not always (non-diagonalizable) | Always exists |
| Primary ML use | PCA (via covariance), spectral methods | PCA (direct), compression, pseudoinverse |
For PCA, you can either:
- Compute the covariance matrix and eigendecompose it
- Compute the SVD of the centered data matrix directly
Option 2 is more numerically stable and avoids forming .
Low-Rank Approximation
Many real-world matrices are approximately low-rank — a few singular values dominate while the rest are small. The truncated SVD exploits this:
This outer product form shows that a rank- matrix is a sum of rank-1 matrices, each weighted by a singular value.
Storage savings: Instead of storing entries, a rank- approximation needs only entries.
Applications
- Image compression: Keep the top singular values of the pixel matrix
- Recommendation systems: Approximate the user-item matrix as low-rank (matrix factorization)
- Latent Semantic Analysis: Truncated SVD of the term-document matrix reveals latent topics
- Denoising: Small singular values often correspond to noise; truncating them cleans the signal
from sklearn.decomposition import TruncatedSVD
# Dimensionality reduction via truncated SVD
svd = TruncatedSVD(n_components=50)
X_reduced = svd.fit_transform(X_sparse)
# Variance explained
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.2%}")
Worked Example: PCA via SVD
Perform PCA on 2D data using SVD.
import numpy as np
# Generate correlated 2D data
np.random.seed(0)
X = np.random.randn(200, 2) @ np.array([[2, 1], [1, 3]])
# Center the data
X_centered = X - X.mean(axis=0)
# SVD
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Principal components are rows of Vt
print("PC1 direction:", Vt[0])
print("PC2 direction:", Vt[1])
# Variance explained by each component
variance_explained = sigma**2 / (sigma**2).sum()
print(f"PC1 explains {variance_explained[0]:.1%} of variance")
print(f"PC2 explains {variance_explained[1]:.1%} of variance")
# Project onto first principal component
X_1d = X_centered @ Vt[0]
Choosing the Right Decomposition
| Situation | Best Decomposition | Reason |
|---|---|---|
| Solve (square) | LU | Fast, general |
| Solve (SPD) | Cholesky | 2x faster, stable |
| Least squares | QR | Numerically stable |
| Dimensionality reduction | SVD / Eigendecomposition | Optimal low-rank approximation |
| Spectral analysis | Eigendecomposition | Reveals intrinsic structure |
| Matrix compression | Truncated SVD | Best rank- approximation |
| Sampling from Gaussian | Cholesky | Efficient and stable |
| Pseudoinverse | SVD | Handles any matrix |
Summary
- SVD () works for any matrix and reveals its geometric structure as rotation-scaling-rotation.
- Truncated SVD gives the optimal low-rank approximation, powering PCA, compression, and denoising.
- LU factors a square matrix for efficient linear system solving.
- QR provides numerically stable least squares via orthogonal factorization.
- Cholesky () is the fastest decomposition for symmetric positive definite matrices.
- The pseudoinverse generalizes the inverse via SVD.
- In ML, the right decomposition choice determines both correctness and computational efficiency.
- Next, we bring everything together in linear algebra in machine learning, connecting all these tools to real ML algorithms.
References
- Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
- Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.
- Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2. MIT Press. deeplearningbook.org