Linear Algebra in Machine Learning: Putting It All Together

Linear Algebra Series 9 / 13

Introduction

Throughout this series, we have built the machinery of linear algebra piece by piece: vectors, matrices, linear systems, determinants, transformations, inner products, eigenvalues, and decompositions.

Now we bring it all together. This article walks through the major ML algorithms and shows exactly where linear algebra appears — not as abstract math, but as the computational engine that makes these algorithms work.

Linear Regression as a Projection

Linear regression is perhaps the purest application of linear algebra in ML. Given data matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ and targets $\mathbf{y} \in \mathbb{R}^m$ , we seek $\hat{\mathbf{w}}$ that minimizes:

\|\mathbf{X}\mathbf{w} - \mathbf{y}\|_2^2

The Normal Equation

Setting the gradient to zero yields the normal equation:

\mathbf{X}^T\mathbf{X}\hat{\mathbf{w}} = \mathbf{X}^T\mathbf{y}

This is a linear system where $\mathbf{X}^T\mathbf{X}$ is the Gram matrix (symmetric, positive semi-definite).

The solution $\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ is an orthogonal projection: the predicted values $\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{w}}$ are the projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$ , and the residual $\mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to every column of $\mathbf{X}$ .

Solving in Practice

Method	When	Complexity
Cholesky on $\mathbf{X}^T\mathbf{X}$	$\mathbf{X}^T\mathbf{X}$ is well-conditioned	$O(mn^2 + n^3/3)$
QR on $\mathbf{X}$	Better numerical stability	$O(mn^2)$
SVD on $\mathbf{X}$	Rank-deficient or ill-conditioned	$O(mn^2)$
Gradient descent	$m$ or $n$ very large	$O(mnk)$ for $k$ iterations

import numpy as np

X = np.random.randn(1000, 10)
y = X @ np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + np.random.randn(1000) * 0.1

# Method 1: Normal equation via Cholesky
L = np.linalg.cholesky(X.T @ X)
w_chol = np.linalg.solve(L @ L.T, X.T @ y)

# Method 2: QR decomposition
Q, R = np.linalg.qr(X)
w_qr = np.linalg.solve(R, Q.T @ y)

# Method 3: SVD (pseudoinverse)
w_svd = np.linalg.lstsq(X, y, rcond=None)[0]

Regularization as Eigenvalue Shifting

Ridge regression adds $\lambda\mathbf{I}$ to $\mathbf{X}^T\mathbf{X}$ :

\hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

In terms of eigenvalues, if $\mathbf{X}^T\mathbf{X}$ has eigenvalues $\sigma_1^2, \ldots, \sigma_n^2$ , then $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ has eigenvalues $\sigma_i^2 + \lambda$ .

Key insight: Ridge regularization shifts all eigenvalues by $\lambda$ , turning small eigenvalues into larger ones. This improves the condition number from $\sigma_1^2/\sigma_n^2$ to $(\sigma_1^2 + \lambda)/(\sigma_n^2 + \lambda)$ , stabilizing the solution and preventing overfitting.

PCA: The Eigenvalue Problem in Disguise

Principal Component Analysis finds the directions of maximum variance in data. Given centered data $\mathbf{X} \in \mathbb{R}^{m \times n}$ :

Compute the covariance matrix $\boldsymbol{\Sigma} = \frac{1}{m-1}\mathbf{X}^T\mathbf{X}$
Eigendecompose: $\boldsymbol{\Sigma} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$
The eigenvectors (columns of $\mathbf{Q}$ ) are principal components
The eigenvalues (diagonal of $\boldsymbol{\Lambda}$ ) are variances along each component

Equivalently, via SVD of $\mathbf{X}$ :

\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T

The right singular vectors $\mathbf{V}$ are the principal components, and $\sigma_i^2/(m-1)$ gives the variance along the $i$ -th component.

Dimensionality Reduction

Project data onto the top $k$ components:

\mathbf{Z} = \mathbf{X}\mathbf{V}_k \in \mathbb{R}^{m \times k}

The fraction of variance retained:

\text{explained variance} = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{n} \lambda_i}

from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)  # Keep 95% of variance
X_reduced = pca.fit_transform(X)
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} dimensions")

Neural Networks: Layers of Linear Algebra

A fully connected neural network layer computes:

\mathbf{h}^{(l)} = \sigma\left(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right)

This is an affine transformation $\mathbf{W}\mathbf{h} + \mathbf{b}$ followed by a nonlinearity $\sigma$ .

Weight Matrix Dimensions

For a network with layer widths $d_0 \to d_1 \to d_2 \to \cdots \to d_L$ :

$\mathbf{W}^{(l)} \in \mathbb{R}^{d_l \times d_{l-1}}$ maps $d_{l-1}$ -dimensional inputs to $d_l$ -dimensional outputs
The total parameters in layer $l$ : $d_l \cdot d_{l-1} + d_l$ (weights + biases)

Backpropagation is Matrix Calculus

The gradient of the loss with respect to weights involves transposed weight matrices:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial \mathbf{W}^{(l)}}

The chain rule through layers produces products of weight matrices (transposed). This is why eigenvalues matter for training: if the product $\mathbf{W}^{(L)} \cdots \mathbf{W}^{(2)} \mathbf{W}^{(1)}$ has eigenvalues much larger or smaller than 1, gradients explode or vanish.

Weight Initialization

Good initialization ensures eigenvalues of weight matrices stay near 1:

Method	Formula	When to Use
Xavier/Glorot	$w \sim \mathcal{N}\left(0, \frac{2}{d_{\text{in}} + d_{\text{out}}}\right)$	Sigmoid, tanh
He/Kaiming	$w \sim \mathcal{N}\left(0, \frac{2}{d_{\text{in}}}\right)$	ReLU
Orthogonal	$\mathbf{W} = \mathbf{Q}$ (orthogonal matrix)	RNNs, deep networks

Orthogonal initialization sets $\mathbf{W}$ to a random orthogonal matrix, guaranteeing all singular values equal 1. This perfectly preserves gradient magnitudes through the layer.

Attention as Matrix Operations

The self-attention mechanism in Transformers is entirely built from matrix operations.

Given input $\mathbf{X} \in \mathbb{R}^{n \times d}$ (sequence of $n$ tokens, each $d$ -dimensional):

\begin{aligned} \mathbf{Q} &= \mathbf{X}\mathbf{W}_Q \\[6pt] \mathbf{K} &= \mathbf{X}\mathbf{W}_K \\[6pt] \mathbf{V} &= \mathbf{X}\mathbf{W}_V \end{aligned}

The attention output:

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

Breaking this down:

Linear projections $\mathbf{X}\mathbf{W}_Q$ , $\mathbf{X}\mathbf{W}_K$ , $\mathbf{X}\mathbf{W}_V$ — matrix multiplications
Similarity matrix $\mathbf{Q}\mathbf{K}^T$ — a matrix of dot products (inner products between queries and keys)
Softmax — applied row-wise (nonlinear, but to a matrix)
Weighted aggregation $\text{softmax}(\cdot)\mathbf{V}$ — another matrix multiplication

Key insight: The attention matrix $\mathbf{Q}\mathbf{K}^T$ is an $n \times n$ matrix where entry $(i,j)$ measures how much token $i$ should attend to token $j$ . This is a generalized inner product — the same cosine similarity idea from the inner products article, lifted to learned representation spaces.

Embeddings and Similarity

Word embeddings (Word2Vec, GloVe) and modern contextual embeddings (BERT, GPT) map tokens to vectors in $\mathbb{R}^d$ .

The core operations are all linear algebra:

Similarity: Cosine similarity $\frac{\mathbf{u}^T\mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|}$ between embedding vectors
Analogies: $\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}$ — vector arithmetic
Retrieval: Nearest neighbor search in embedding space
Embedding matrix: The lookup table is a matrix $\mathbf{E} \in \mathbb{R}^{|V| \times d}$ where row $i$ is the embedding of the $i$ -th token

Recommendation Systems: Matrix Factorization

Collaborative filtering models the user-item interaction matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ as a low-rank product:

\mathbf{R} \approx \mathbf{U}\mathbf{V}^T

where $\mathbf{U} \in \mathbb{R}^{m \times k}$ (user factors) and $\mathbf{V} \in \mathbb{R}^{n \times k}$ (item factors).

Each user is a $k$ -dimensional vector. Each item is a $k$ -dimensional vector. The predicted rating is their dot product — a direct application of inner products.

This is a truncated SVD problem: the best rank- $k$ approximation of $\mathbf{R}$ (in the Frobenius norm) is given by keeping the top $k$ singular values and vectors.

Kernel Methods: Inner Products in Feature Space

Support Vector Machines and other kernel methods use the kernel trick: instead of computing features explicitly, compute inner products in a high-dimensional (possibly infinite-dimensional) feature space:

k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle

The kernel matrix (Gram matrix) $\mathbf{K}$ with entries $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be symmetric positive semi-definite — a direct consequence of being a matrix of inner products.

Common kernels:

Kernel	Formula	Feature Space
Linear	$\mathbf{x}^T\mathbf{x}'$	Original space
Polynomial	$(\mathbf{x}^T\mathbf{x}' + c)^d$	Polynomial features
RBF (Gaussian)	$\exp\left(-\frac{\\|\mathbf{x} - \mathbf{x}'\\|^2}{2\sigma^2}\right)$	Infinite-dimensional

Graph Neural Networks: The Graph Laplacian

Graph-based ML uses the graph Laplacian matrix:

\mathbf{L} = \mathbf{D} - \mathbf{A}

where $\mathbf{A}$ is the adjacency matrix and $\mathbf{D}$ is the degree matrix.

The eigenvalues of $\mathbf{L}$ encode graph structure:

The number of zero eigenvalues equals the number of connected components
The second-smallest eigenvalue (Fiedler value) measures graph connectivity
Eigenvectors of $\mathbf{L}$ define spectral embeddings for clustering

Spectral clustering partitions data by:

Building a similarity graph
Computing eigenvectors of the graph Laplacian
Clustering in the eigenvector space

Batch Normalization: Statistics as Linear Algebra

Batch normalization normalizes activations using batch statistics:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

The mean $\mu_B$ and variance $\sigma_B^2$ are computed from the batch — these are the first moments of the data, which are linear and quadratic operations on vectors.

The whitening generalization transforms the batch so that features are uncorrelated:

\hat{\mathbf{x}} = \boldsymbol{\Sigma}^{-1/2}(\mathbf{x} - \boldsymbol{\mu})

Computing $\boldsymbol{\Sigma}^{-1/2}$ requires eigendecomposition of the covariance matrix:

\boldsymbol{\Sigma}^{-1/2} = \mathbf{Q}\boldsymbol{\Lambda}^{-1/2}\mathbf{Q}^T

Computational Considerations

GPU Acceleration

GPUs are essentially parallel matrix multiplication engines. The reason deep learning became practical in the 2010s is that neural network training reduces to massive matrix multiplications, which GPUs handle orders of magnitude faster than CPUs.

Operation	CPU	GPU
Matrix multiply ( $4096 \times 4096$ )	~seconds	~milliseconds
Batch of matrix multiplies	Sequential	Parallel
Memory bandwidth	~50 GB/s	~900+ GB/s

Numerical Stability

Linear algebra operations can amplify errors when matrices are ill-conditioned:

Never compute $\mathbf{A}^{-1}\mathbf{b}$ — solve the system $\mathbf{A}\mathbf{x} = \mathbf{b}$ instead
Use QR instead of the normal equation for least squares
Add regularization ( $\lambda\mathbf{I}$ ) to improve conditioning
Use mixed precision carefully — half-precision (FP16) matrix multiplication can lose accuracy
Monitor condition numbers when debugging numerical issues

Summary

Linear algebra is not a prerequisite you learn and forget — it is the active computational substrate of ML:

Linear regression is an orthogonal projection; the normal equation is a linear system.
PCA is an eigenvalue problem on the covariance matrix (or SVD on the data matrix).
Neural network layers are affine transformations; eigenvalues of weight matrices control gradient flow.
Attention in Transformers is built from matrix multiplications and inner products.
Embeddings live in vector spaces; similarity is measured by inner products.
Recommendation systems factorize the user-item matrix (truncated SVD).
Kernel methods compute inner products in high-dimensional feature spaces.
Spectral methods use eigenvectors of graph Laplacians for clustering.
Numerical stability requires choosing the right decomposition and avoiding explicit inverses.

The tools from this linear algebra series — vectors, matrices, projections, eigenvalues, and decompositions — are not just mathematical background. They are the language in which ML algorithms think, compute, and learn.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.
Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 42(8), 30-37.
Stanford CS229 — Machine Learning. cs229.stanford.edu