- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
Throughout this series, we have built the machinery of linear algebra piece by piece: vectors, matrices, linear systems, determinants, transformations, inner products, eigenvalues, and decompositions.
Now we bring it all together. This article walks through the major ML algorithms and shows exactly where linear algebra appears — not as abstract math, but as the computational engine that makes these algorithms work.
Linear Regression as a Projection
Linear regression is perhaps the purest application of linear algebra in ML. Given data matrix and targets , we seek that minimizes:
The Normal Equation
Setting the gradient to zero yields the normal equation:
This is a linear system where is the Gram matrix (symmetric, positive semi-definite).
The solution is an orthogonal projection: the predicted values are the projection of onto the column space of , and the residual is orthogonal to every column of .
Solving in Practice
| Method | When | Complexity |
|---|---|---|
| Cholesky on | is well-conditioned | |
| QR on | Better numerical stability | |
| SVD on | Rank-deficient or ill-conditioned | |
| Gradient descent | or very large | for iterations |
import numpy as np
X = np.random.randn(1000, 10)
y = X @ np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + np.random.randn(1000) * 0.1
# Method 1: Normal equation via Cholesky
L = np.linalg.cholesky(X.T @ X)
w_chol = np.linalg.solve(L @ L.T, X.T @ y)
# Method 2: QR decomposition
Q, R = np.linalg.qr(X)
w_qr = np.linalg.solve(R, Q.T @ y)
# Method 3: SVD (pseudoinverse)
w_svd = np.linalg.lstsq(X, y, rcond=None)[0]
Regularization as Eigenvalue Shifting
Ridge regression adds to :
In terms of eigenvalues, if has eigenvalues , then has eigenvalues .
Key insight: Ridge regularization shifts all eigenvalues by , turning small eigenvalues into larger ones. This improves the condition number from to , stabilizing the solution and preventing overfitting.
PCA: The Eigenvalue Problem in Disguise
Principal Component Analysis finds the directions of maximum variance in data. Given centered data :
- Compute the covariance matrix
- Eigendecompose:
- The eigenvectors (columns of ) are principal components
- The eigenvalues (diagonal of ) are variances along each component
Equivalently, via SVD of :
The right singular vectors are the principal components, and gives the variance along the -th component.
Dimensionality Reduction
Project data onto the top components:
The fraction of variance retained:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # Keep 95% of variance
X_reduced = pca.fit_transform(X)
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} dimensions")
Neural Networks: Layers of Linear Algebra
A fully connected neural network layer computes:
This is an affine transformation followed by a nonlinearity .
Weight Matrix Dimensions
For a network with layer widths :
- maps -dimensional inputs to -dimensional outputs
- The total parameters in layer : (weights + biases)
Backpropagation is Matrix Calculus
The gradient of the loss with respect to weights involves transposed weight matrices:
The chain rule through layers produces products of weight matrices (transposed). This is why eigenvalues matter for training: if the product has eigenvalues much larger or smaller than 1, gradients explode or vanish.
Weight Initialization
Good initialization ensures eigenvalues of weight matrices stay near 1:
| Method | Formula | When to Use |
|---|---|---|
| Xavier/Glorot | Sigmoid, tanh | |
| He/Kaiming | ReLU | |
| Orthogonal | (orthogonal matrix) | RNNs, deep networks |
Orthogonal initialization sets to a random orthogonal matrix, guaranteeing all singular values equal 1. This perfectly preserves gradient magnitudes through the layer.
Attention as Matrix Operations
The self-attention mechanism in Transformers is entirely built from matrix operations.
Given input (sequence of tokens, each -dimensional):
The attention output:
Breaking this down:
- Linear projections , , — matrix multiplications
- Similarity matrix — a matrix of dot products (inner products between queries and keys)
- Softmax — applied row-wise (nonlinear, but to a matrix)
- Weighted aggregation — another matrix multiplication
Key insight: The attention matrix is an matrix where entry measures how much token should attend to token . This is a generalized inner product — the same cosine similarity idea from the inner products article, lifted to learned representation spaces.
Embeddings and Similarity
Word embeddings (Word2Vec, GloVe) and modern contextual embeddings (BERT, GPT) map tokens to vectors in .
The core operations are all linear algebra:
- Similarity: Cosine similarity between embedding vectors
- Analogies: — vector arithmetic
- Retrieval: Nearest neighbor search in embedding space
- Embedding matrix: The lookup table is a matrix where row is the embedding of the -th token
Recommendation Systems: Matrix Factorization
Collaborative filtering models the user-item interaction matrix as a low-rank product:
where (user factors) and (item factors).
Each user is a -dimensional vector. Each item is a -dimensional vector. The predicted rating is their dot product — a direct application of inner products.
This is a truncated SVD problem: the best rank- approximation of (in the Frobenius norm) is given by keeping the top singular values and vectors.
Kernel Methods: Inner Products in Feature Space
Support Vector Machines and other kernel methods use the kernel trick: instead of computing features explicitly, compute inner products in a high-dimensional (possibly infinite-dimensional) feature space:
The kernel matrix (Gram matrix) with entries must be symmetric positive semi-definite — a direct consequence of being a matrix of inner products.
Common kernels:
| Kernel | Formula | Feature Space |
|---|---|---|
| Linear | Original space | |
| Polynomial | Polynomial features | |
| RBF (Gaussian) | Infinite-dimensional |
Graph Neural Networks: The Graph Laplacian
Graph-based ML uses the graph Laplacian matrix:
where is the adjacency matrix and is the degree matrix.
The eigenvalues of encode graph structure:
- The number of zero eigenvalues equals the number of connected components
- The second-smallest eigenvalue (Fiedler value) measures graph connectivity
- Eigenvectors of define spectral embeddings for clustering
Spectral clustering partitions data by:
- Building a similarity graph
- Computing eigenvectors of the graph Laplacian
- Clustering in the eigenvector space
Batch Normalization: Statistics as Linear Algebra
Batch normalization normalizes activations using batch statistics:
The mean and variance are computed from the batch — these are the first moments of the data, which are linear and quadratic operations on vectors.
The whitening generalization transforms the batch so that features are uncorrelated:
Computing requires eigendecomposition of the covariance matrix:
Computational Considerations
GPU Acceleration
GPUs are essentially parallel matrix multiplication engines. The reason deep learning became practical in the 2010s is that neural network training reduces to massive matrix multiplications, which GPUs handle orders of magnitude faster than CPUs.
| Operation | CPU | GPU |
|---|---|---|
| Matrix multiply () | ~seconds | ~milliseconds |
| Batch of matrix multiplies | Sequential | Parallel |
| Memory bandwidth | ~50 GB/s | ~900+ GB/s |
Numerical Stability
Linear algebra operations can amplify errors when matrices are ill-conditioned:
- Never compute — solve the system instead
- Use QR instead of the normal equation for least squares
- Add regularization () to improve conditioning
- Use mixed precision carefully — half-precision (FP16) matrix multiplication can lose accuracy
- Monitor condition numbers when debugging numerical issues
Summary
Linear algebra is not a prerequisite you learn and forget — it is the active computational substrate of ML:
- Linear regression is an orthogonal projection; the normal equation is a linear system.
- PCA is an eigenvalue problem on the covariance matrix (or SVD on the data matrix).
- Neural network layers are affine transformations; eigenvalues of weight matrices control gradient flow.
- Attention in Transformers is built from matrix multiplications and inner products.
- Embeddings live in vector spaces; similarity is measured by inner products.
- Recommendation systems factorize the user-item matrix (truncated SVD).
- Kernel methods compute inner products in high-dimensional feature spaces.
- Spectral methods use eigenvectors of graph Laplacians for clustering.
- Numerical stability requires choosing the right decomposition and avoiding explicit inverses.
The tools from this linear algebra series — vectors, matrices, projections, eigenvalues, and decompositions — are not just mathematical background. They are the language in which ML algorithms think, compute, and learn.
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
- Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.
- Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
- Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762
- Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), 30-37.
- Stanford CS229 — Machine Learning. cs229.stanford.edu