Linear Algebra in Machine Learning: Putting It All Together

See how vectors, matrices, eigenvalues, and decompositions drive real ML algorithms — from linear regression to Transformers and beyond.

Linear Algebra March 6, 2026 9 min read

Introduction

Throughout this series, we have built the machinery of linear algebra piece by piece: vectors, matrices, linear systems, determinants, transformations, inner products, eigenvalues, and decompositions.

Now we bring it all together. This article walks through the major ML algorithms and shows exactly where linear algebra appears — not as abstract math, but as the computational engine that makes these algorithms work.

Linear Regression as a Projection

Linear regression is perhaps the purest application of linear algebra in ML. Given data matrix XRm×n\mathbf{X} \in \mathbb{R}^{m \times n} and targets yRm\mathbf{y} \in \mathbb{R}^m, we seek w^\hat{\mathbf{w}} that minimizes:

Xwy22\|\mathbf{X}\mathbf{w} - \mathbf{y}\|_2^2

The Normal Equation

Setting the gradient to zero yields the normal equation:

XTXw^=XTy\mathbf{X}^T\mathbf{X}\hat{\mathbf{w}} = \mathbf{X}^T\mathbf{y}

This is a linear system where XTX\mathbf{X}^T\mathbf{X} is the Gram matrix (symmetric, positive semi-definite).

The solution w^=(XTX)1XTy\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} is an orthogonal projection: the predicted values y^=Xw^\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{w}} are the projection of y\mathbf{y} onto the column space of X\mathbf{X}, and the residual yy^\mathbf{y} - \hat{\mathbf{y}} is orthogonal to every column of X\mathbf{X}.

Solving in Practice

MethodWhenComplexity
Cholesky on XTX\mathbf{X}^T\mathbf{X}XTX\mathbf{X}^T\mathbf{X} is well-conditionedO(mn2+n3/3)O(mn^2 + n^3/3)
QR on X\mathbf{X}Better numerical stabilityO(mn2)O(mn^2)
SVD on X\mathbf{X}Rank-deficient or ill-conditionedO(mn2)O(mn^2)
Gradient descentmm or nn very largeO(mnk)O(mnk) for kk iterations
import numpy as np

X = np.random.randn(1000, 10)
y = X @ np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + np.random.randn(1000) * 0.1

# Method 1: Normal equation via Cholesky
L = np.linalg.cholesky(X.T @ X)
w_chol = np.linalg.solve(L @ L.T, X.T @ y)

# Method 2: QR decomposition
Q, R = np.linalg.qr(X)
w_qr = np.linalg.solve(R, Q.T @ y)

# Method 3: SVD (pseudoinverse)
w_svd = np.linalg.lstsq(X, y, rcond=None)[0]

Regularization as Eigenvalue Shifting

Ridge regression adds λI\lambda\mathbf{I} to XTX\mathbf{X}^T\mathbf{X}:

w^ridge=(XTX+λI)1XTy\hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

In terms of eigenvalues, if XTX\mathbf{X}^T\mathbf{X} has eigenvalues σ12,,σn2\sigma_1^2, \ldots, \sigma_n^2, then XTX+λI\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I} has eigenvalues σi2+λ\sigma_i^2 + \lambda.

Key insight: Ridge regularization shifts all eigenvalues by λ\lambda, turning small eigenvalues into larger ones. This improves the condition number from σ12/σn2\sigma_1^2/\sigma_n^2 to (σ12+λ)/(σn2+λ)(\sigma_1^2 + \lambda)/(\sigma_n^2 + \lambda), stabilizing the solution and preventing overfitting.

PCA: The Eigenvalue Problem in Disguise

Principal Component Analysis finds the directions of maximum variance in data. Given centered data XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}:

  1. Compute the covariance matrix Σ=1m1XTX\boldsymbol{\Sigma} = \frac{1}{m-1}\mathbf{X}^T\mathbf{X}
  2. Eigendecompose: Σ=QΛQT\boldsymbol{\Sigma} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T
  3. The eigenvectors (columns of Q\mathbf{Q}) are principal components
  4. The eigenvalues (diagonal of Λ\boldsymbol{\Lambda}) are variances along each component

Equivalently, via SVD of X\mathbf{X}:

X=UΣVT\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T

The right singular vectors V\mathbf{V} are the principal components, and σi2/(m1)\sigma_i^2/(m-1) gives the variance along the ii-th component.

Dimensionality Reduction

Project data onto the top kk components:

Z=XVkRm×k\mathbf{Z} = \mathbf{X}\mathbf{V}_k \in \mathbb{R}^{m \times k}

The fraction of variance retained:

explained variance=i=1kλii=1nλi\text{explained variance} = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{n} \lambda_i}
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)  # Keep 95% of variance
X_reduced = pca.fit_transform(X)
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} dimensions")

Neural Networks: Layers of Linear Algebra

A fully connected neural network layer computes:

h(l)=σ(W(l)h(l1)+b(l))\mathbf{h}^{(l)} = \sigma\left(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right)

This is an affine transformation Wh+b\mathbf{W}\mathbf{h} + \mathbf{b} followed by a nonlinearity σ\sigma.

Weight Matrix Dimensions

For a network with layer widths d0d1d2dLd_0 \to d_1 \to d_2 \to \cdots \to d_L:

  • W(l)Rdl×dl1\mathbf{W}^{(l)} \in \mathbb{R}^{d_l \times d_{l-1}} maps dl1d_{l-1}-dimensional inputs to dld_l-dimensional outputs
  • The total parameters in layer ll: dldl1+dld_l \cdot d_{l-1} + d_l (weights + biases)

Backpropagation is Matrix Calculus

The gradient of the loss with respect to weights involves transposed weight matrices:

LW(l)=Lh(l)h(l)W(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{W}^{(l)}}

The chain rule through layers produces products of weight matrices (transposed). This is why eigenvalues matter for training: if the product W(L)W(2)W(1)\mathbf{W}^{(L)} \cdots \mathbf{W}^{(2)} \mathbf{W}^{(1)} has eigenvalues much larger or smaller than 1, gradients explode or vanish.

Weight Initialization

Good initialization ensures eigenvalues of weight matrices stay near 1:

MethodFormulaWhen to Use
Xavier/GlorotwN(0,2din+dout)w \sim \mathcal{N}\left(0, \frac{2}{d_{\text{in}} + d_{\text{out}}}\right)Sigmoid, tanh
He/KaimingwN(0,2din)w \sim \mathcal{N}\left(0, \frac{2}{d_{\text{in}}}\right)ReLU
OrthogonalW=Q\mathbf{W} = \mathbf{Q} (orthogonal matrix)RNNs, deep networks

Orthogonal initialization sets W\mathbf{W} to a random orthogonal matrix, guaranteeing all singular values equal 1. This perfectly preserves gradient magnitudes through the layer.

Attention as Matrix Operations

The self-attention mechanism in Transformers is entirely built from matrix operations.

Given input XRn×d\mathbf{X} \in \mathbb{R}^{n \times d} (sequence of nn tokens, each dd-dimensional):

Q=XWQK=XWKV=XWV\begin{aligned} \mathbf{Q} &= \mathbf{X}\mathbf{W}_Q \\[6pt] \mathbf{K} &= \mathbf{X}\mathbf{W}_K \\[6pt] \mathbf{V} &= \mathbf{X}\mathbf{W}_V \end{aligned}

The attention output:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

Breaking this down:

  1. Linear projections XWQ\mathbf{X}\mathbf{W}_Q, XWK\mathbf{X}\mathbf{W}_K, XWV\mathbf{X}\mathbf{W}_V — matrix multiplications
  2. Similarity matrix QKT\mathbf{Q}\mathbf{K}^T — a matrix of dot products (inner products between queries and keys)
  3. Softmax — applied row-wise (nonlinear, but to a matrix)
  4. Weighted aggregation softmax()V\text{softmax}(\cdot)\mathbf{V} — another matrix multiplication

Key insight: The attention matrix QKT\mathbf{Q}\mathbf{K}^T is an n×nn \times n matrix where entry (i,j)(i,j) measures how much token ii should attend to token jj. This is a generalized inner product — the same cosine similarity idea from the inner products article, lifted to learned representation spaces.

Embeddings and Similarity

Word embeddings (Word2Vec, GloVe) and modern contextual embeddings (BERT, GPT) map tokens to vectors in Rd\mathbb{R}^d.

The core operations are all linear algebra:

  • Similarity: Cosine similarity uTvuv\frac{\mathbf{u}^T\mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|} between embedding vectors
  • Analogies: vkingvman+vwomanvqueen\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}} — vector arithmetic
  • Retrieval: Nearest neighbor search in embedding space
  • Embedding matrix: The lookup table is a matrix ERV×d\mathbf{E} \in \mathbb{R}^{|V| \times d} where row ii is the embedding of the ii-th token

Recommendation Systems: Matrix Factorization

Collaborative filtering models the user-item interaction matrix RRm×n\mathbf{R} \in \mathbb{R}^{m \times n} as a low-rank product:

RUVT\mathbf{R} \approx \mathbf{U}\mathbf{V}^T

where URm×k\mathbf{U} \in \mathbb{R}^{m \times k} (user factors) and VRn×k\mathbf{V} \in \mathbb{R}^{n \times k} (item factors).

Each user is a kk-dimensional vector. Each item is a kk-dimensional vector. The predicted rating is their dot product — a direct application of inner products.

This is a truncated SVD problem: the best rank-kk approximation of R\mathbf{R} (in the Frobenius norm) is given by keeping the top kk singular values and vectors.

Kernel Methods: Inner Products in Feature Space

Support Vector Machines and other kernel methods use the kernel trick: instead of computing features explicitly, compute inner products in a high-dimensional (possibly infinite-dimensional) feature space:

k(x,x)=ϕ(x),ϕ(x)k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle

The kernel matrix (Gram matrix) K\mathbf{K} with entries Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) must be symmetric positive semi-definite — a direct consequence of being a matrix of inner products.

Common kernels:

KernelFormulaFeature Space
LinearxTx\mathbf{x}^T\mathbf{x}'Original space
Polynomial(xTx+c)d(\mathbf{x}^T\mathbf{x}' + c)^dPolynomial features
RBF (Gaussian)exp(xx22σ2)\exp\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\right)Infinite-dimensional

Graph Neural Networks: The Graph Laplacian

Graph-based ML uses the graph Laplacian matrix:

L=DA\mathbf{L} = \mathbf{D} - \mathbf{A}

where A\mathbf{A} is the adjacency matrix and D\mathbf{D} is the degree matrix.

The eigenvalues of L\mathbf{L} encode graph structure:

  • The number of zero eigenvalues equals the number of connected components
  • The second-smallest eigenvalue (Fiedler value) measures graph connectivity
  • Eigenvectors of L\mathbf{L} define spectral embeddings for clustering

Spectral clustering partitions data by:

  1. Building a similarity graph
  2. Computing eigenvectors of the graph Laplacian
  3. Clustering in the eigenvector space

Batch Normalization: Statistics as Linear Algebra

Batch normalization normalizes activations using batch statistics:

x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

The mean μB\mu_B and variance σB2\sigma_B^2 are computed from the batch — these are the first moments of the data, which are linear and quadratic operations on vectors.

The whitening generalization transforms the batch so that features are uncorrelated:

x^=Σ1/2(xμ)\hat{\mathbf{x}} = \boldsymbol{\Sigma}^{-1/2}(\mathbf{x} - \boldsymbol{\mu})

Computing Σ1/2\boldsymbol{\Sigma}^{-1/2} requires eigendecomposition of the covariance matrix:

Σ1/2=QΛ1/2QT\boldsymbol{\Sigma}^{-1/2} = \mathbf{Q}\boldsymbol{\Lambda}^{-1/2}\mathbf{Q}^T

Computational Considerations

GPU Acceleration

GPUs are essentially parallel matrix multiplication engines. The reason deep learning became practical in the 2010s is that neural network training reduces to massive matrix multiplications, which GPUs handle orders of magnitude faster than CPUs.

OperationCPUGPU
Matrix multiply (4096×40964096 \times 4096)~seconds~milliseconds
Batch of matrix multipliesSequentialParallel
Memory bandwidth~50 GB/s~900+ GB/s

Numerical Stability

Linear algebra operations can amplify errors when matrices are ill-conditioned:

  • Never compute A1b\mathbf{A}^{-1}\mathbf{b} — solve the system Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} instead
  • Use QR instead of the normal equation for least squares
  • Add regularization (λI\lambda\mathbf{I}) to improve conditioning
  • Use mixed precision carefully — half-precision (FP16) matrix multiplication can lose accuracy
  • Monitor condition numbers when debugging numerical issues

Summary

Linear algebra is not a prerequisite you learn and forget — it is the active computational substrate of ML:

  • Linear regression is an orthogonal projection; the normal equation is a linear system.
  • PCA is an eigenvalue problem on the covariance matrix (or SVD on the data matrix).
  • Neural network layers are affine transformations; eigenvalues of weight matrices control gradient flow.
  • Attention in Transformers is built from matrix multiplications and inner products.
  • Embeddings live in vector spaces; similarity is measured by inner products.
  • Recommendation systems factorize the user-item matrix (truncated SVD).
  • Kernel methods compute inner products in high-dimensional feature spaces.
  • Spectral methods use eigenvectors of graph Laplacians for clustering.
  • Numerical stability requires choosing the right decomposition and avoiding explicit inverses.

The tools from this linear algebra series — vectors, matrices, projections, eigenvalues, and decompositions — are not just mathematical background. They are the language in which ML algorithms think, compute, and learn.

References

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
  • Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.
  • Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
  • Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762
  • Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), 30-37.
  • Stanford CS229 — Machine Learning. cs229.stanford.edu

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay