Systems of Linear Equations: From Geometry to Algorithms

Linear Algebra Series 3 / 13

Introduction

A linear system is a set of equations where each equation is linear in the unknowns. Solving linear systems is one of the most fundamental operations in computational science — and in machine learning, it appears everywhere: fitting a linear model, computing the normal equation, solving for network weights, and running iterative optimization.

This article covers how to solve systems systematically, what the solution set looks like geometrically, and when a solution does or does not exist. We build on the matrix operations from the previous article.

Setting Up the Problem

A system of $m$ linear equations in $n$ unknowns can be written as:

\begin{aligned} a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n &= b_1 \\[6pt] a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n &= b_2 \\[6pt] &\;\vdots \\[6pt] a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n &= b_m \end{aligned}

In matrix form, this becomes:

\mathbf{A}\mathbf{x} = \mathbf{b}

where $\mathbf{A} \in \mathbb{R}^{m \times n}$ is the coefficient matrix, $\mathbf{x} \in \mathbb{R}^n$ is the unknown vector, and $\mathbf{b} \in \mathbb{R}^m$ is the right-hand side.

Key insight: The equation $\mathbf{A}\mathbf{x} = \mathbf{b}$ asks: “Can $\mathbf{b}$ be written as a linear combination of the columns of $\mathbf{A}$ ?” The solution $\mathbf{x}$ gives the coefficients of that combination.

Geometric Interpretation

There are two dual ways to visualize $\mathbf{A}\mathbf{x} = \mathbf{b}$ :

Row Picture

Each equation defines a hyperplane in $\mathbb{R}^n$ . The solution is the intersection of all hyperplanes.

In 2D, each equation is a line. Two lines can:

Intersect at one point (unique solution)
Be parallel (no solution)
Be the same line (infinitely many solutions)

Column Picture

The equation asks whether $\mathbf{b}$ lies in the column space of $\mathbf{A}$ — the span of the columns of $\mathbf{A}$ .

If $\mathbf{b}$ is in the column space, a solution exists. If not, no exact solution exists — but we can find the closest approximation via least squares, which is the foundation of linear regression.

Gaussian Elimination

Gaussian elimination is the systematic algorithm for solving linear systems. It uses three elementary row operations:

Swap two rows: $R_i \leftrightarrow R_j$
Scale a row: $R_i \leftarrow cR_i$ where $c \neq 0$
Add a multiple of one row to another: $R_i \leftarrow R_i + cR_j$

These operations do not change the solution set.

The Augmented Matrix

We work with the augmented matrix $[\mathbf{A} \mid \mathbf{b}]$ , which appends $\mathbf{b}$ as an extra column:

[\mathbf{A} \mid \mathbf{b}] = \left[\begin{array}{ccc|c} a_{11} & a_{12} & a_{13} & b_1 \\ a_{21} & a_{22} & a_{23} & b_2 \\ a_{31} & a_{32} & a_{33} & b_3 \end{array}\right]

Algorithm

Start from the leftmost column.
Find a nonzero entry in the current column (the pivot). If necessary, swap rows to bring it to the top.
Use the pivot to eliminate all entries below it by subtracting appropriate multiples of the pivot row.
Move to the next column and repeat.
The result is row echelon form (REF).
Optionally, continue to reduced row echelon form (RREF) by eliminating entries above each pivot and scaling pivots to 1.

Worked Example

Solve the system:

\begin{aligned} x + 2y + z &= 9 \\[6pt] 2x + 4y + 3z &= 21 \\[6pt] 3x + 7y + 4z &= 30 \end{aligned}

Form the augmented matrix and apply row operations:

\left[\begin{array}{ccc|c} 1 & 2 & 1 & 9 \\ 2 & 4 & 3 & 21 \\ 3 & 7 & 4 & 30 \end{array}\right] \xrightarrow{R_2 - 2R_1} \left[\begin{array}{ccc|c} 1 & 2 & 1 & 9 \\ 0 & 0 & 1 & 3 \\ 3 & 7 & 4 & 30 \end{array}\right]

\xrightarrow{R_3 - 3R_1} \left[\begin{array}{ccc|c} 1 & 2 & 1 & 9 \\ 0 & 0 & 1 & 3 \\ 0 & 1 & 1 & 3 \end{array}\right] \xrightarrow{R_2 \leftrightarrow R_3} \left[\begin{array}{ccc|c} 1 & 2 & 1 & 9 \\ 0 & 1 & 1 & 3 \\ 0 & 0 & 1 & 3 \end{array}\right]

This is row echelon form. Back-substitute:

\begin{aligned} z &= 3 \\[6pt] y + z &= 3 \implies y = 0 \\[6pt] x + 2y + z &= 9 \implies x = 6 \end{aligned}

The unique solution is $\mathbf{x} = [6, 0, 3]^T$ .

import numpy as np

A = np.array([[1, 2, 1],
              [2, 4, 3],
              [3, 7, 4]])
b = np.array([9, 21, 30])

x = np.linalg.solve(A, b)
print(x)  # [6. 0. 3.]

Row Echelon Form and Pivots

A matrix is in row echelon form (REF) if:

All zero rows are at the bottom.
The leading nonzero entry (pivot) in each row is strictly to the right of the pivot in the row above.
All entries below a pivot are zero.

It is in reduced row echelon form (RREF) if additionally:

Each pivot is 1.
Each pivot is the only nonzero entry in its column.

The positions of the pivots determine everything about the solution:

Pivot columns correspond to basic variables (determined by the system).
Non-pivot columns correspond to free variables (can take any value).

The Three Cases

The number of solutions depends on the relationship between equations and unknowns:

Case	Condition	Solution Set
Unique solution	$\text{rank}(\mathbf{A}) = \text{rank}([\mathbf{A} \mid \mathbf{b}]) = n$	One point
No solution	$\text{rank}(\mathbf{A}) < \text{rank}([\mathbf{A} \mid \mathbf{b}])$	Empty (inconsistent)
Infinite solutions	$\text{rank}(\mathbf{A}) = \text{rank}([\mathbf{A} \mid \mathbf{b}]) < n$	Affine subspace

This is the Rouché-Capelli theorem (also called the Kronecker-Capelli theorem).

Key insight: In ML, overdetermined systems ( $m > n$ , more equations than unknowns) typically have no exact solution. Instead of solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ exactly, we minimize $\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2$ — this is least squares regression.

Homogeneous Systems

A homogeneous system has $\mathbf{b} = \mathbf{0}$ :

\mathbf{A}\mathbf{x} = \mathbf{0}

The trivial solution $\mathbf{x} = \mathbf{0}$ always exists. Nontrivial solutions exist if and only if $\text{rank}(\mathbf{A}) < n$ — that is, there are free variables.

The set of all solutions to $\mathbf{A}\mathbf{x} = \mathbf{0}$ is the null space (or kernel) of $\mathbf{A}$ :

\text{null}(\mathbf{A}) = \{\mathbf{x} \in \mathbb{R}^n : \mathbf{A}\mathbf{x} = \mathbf{0}\}

The null space is always a subspace of $\mathbb{R}^n$ . Its dimension is called the nullity:

\text{nullity}(\mathbf{A}) = n - \text{rank}(\mathbf{A})

This is the rank-nullity theorem, one of the most important results in linear algebra.

The Four Fundamental Subspaces

Every matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ defines four subspaces:

Subspace	Definition	Dimension
Column space $C(\mathbf{A})$	$\{\mathbf{A}\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}$	$r = \text{rank}(\mathbf{A})$
Row space $C(\mathbf{A}^T)$	Span of the rows of $\mathbf{A}$	$r$
Null space $N(\mathbf{A})$	$\{\mathbf{x} : \mathbf{A}\mathbf{x} = \mathbf{0}\}$	$n - r$
Left null space $N(\mathbf{A}^T)$	$\{\mathbf{y} : \mathbf{A}^T\mathbf{y} = \mathbf{0}\}$	$m - r$

These four subspaces satisfy orthogonal complement relationships:

The row space and null space are orthogonal complements in $\mathbb{R}^n$
The column space and left null space are orthogonal complements in $\mathbb{R}^m$

Geometric interpretation: When solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ , the column space tells you which $\mathbf{b}$ vectors are reachable. The null space tells you the directions you can move in $\mathbf{x}$ -space without changing $\mathbf{A}\mathbf{x}$ .

LU Decomposition

In practice, Gaussian elimination is implemented as LU decomposition: factoring $\mathbf{A}$ into a lower triangular matrix $\mathbf{L}$ and an upper triangular matrix $\mathbf{U}$ :

\mathbf{A} = \mathbf{L}\mathbf{U}

Solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ then becomes two triangular solves:

Solve $\mathbf{L}\mathbf{y} = \mathbf{b}$ for $\mathbf{y}$ (forward substitution)
Solve $\mathbf{U}\mathbf{x} = \mathbf{y}$ for $\mathbf{x}$ (back substitution)

Each triangular solve costs $O(n^2)$ , and the factorization costs $O(n^3)$ . The advantage: once you have the factorization, solving for a new $\mathbf{b}$ only costs $O(n^2)$ .

import numpy as np
from scipy.linalg import lu, solve_triangular

A = np.array([[2, 1, 1],
              [4, 3, 3],
              [8, 7, 9]])

P, L, U = lu(A)

# Solve Ax = b using LU
b = np.array([4, 10, 24])
y = solve_triangular(L, P.T @ b, lower=True)
x = solve_triangular(U, y, lower=False)
print(x)  # [1. 1. 1.]

Computational Complexity

Method	Cost	When to Use
Gaussian elimination	$O(n^3)$	General dense systems
LU decomposition	$O(n^3)$ once, $O(n^2)$ per solve	Multiple right-hand sides
Cholesky ( $\mathbf{A} = \mathbf{L}\mathbf{L}^T$ )	$O(n^3/3)$	Symmetric positive definite
Iterative methods (CG, GMRES)	$O(nk)$ per iteration	Large sparse systems

For the massive systems that arise in ML (millions of parameters), direct methods are too expensive. Instead, we use iterative methods — especially gradient descent, which can be viewed as an iterative solver for the normal equations.

Why This Matters for ML

Linear systems are the computational engine behind many ML methods:

Normal equation: The closed-form solution to linear regression is $\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ , which is solved as a linear system $(\mathbf{X}^T\mathbf{X})\hat{\mathbf{w}} = \mathbf{X}^T\mathbf{y}$ .
Least squares: When $\mathbf{A}\mathbf{x} = \mathbf{b}$ has no solution, the least squares solution minimizes $\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2$ .
Gaussian processes: Predictions require solving systems involving the kernel matrix.
Newton’s method: Each step solves $\mathbf{H}\Delta\mathbf{x} = -\mathbf{g}$ , where $\mathbf{H}$ is the Hessian and $\mathbf{g}$ is the gradient.
Rank and redundancy: Rank-deficient systems signal multicollinearity in features, which destabilizes model fitting.

Summary

A linear system $\mathbf{A}\mathbf{x} = \mathbf{b}$ asks whether $\mathbf{b}$ is a linear combination of the columns of $\mathbf{A}$ .
Gaussian elimination transforms the system to row echelon form using elementary row operations.
The system has a unique solution, no solution, or infinitely many solutions, determined by the ranks of $\mathbf{A}$ and $[\mathbf{A} \mid \mathbf{b}]$ .
The null space captures all directions of freedom in the solution.
The four fundamental subspaces provide a complete geometric picture of any matrix.
LU decomposition is the practical implementation of Gaussian elimination.
In ML, overdetermined systems lead to least squares — the mathematical foundation of regression.
Next, we explore determinants, which provide a single number that captures whether a system has a unique solution.

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM.
Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu