Matrices and Matrix Operations: Organizing Linear Computation

Master matrix arithmetic, special matrix types, and the rules of matrix algebra that power every ML algorithm.

Linear Algebra March 6, 2026 9 min read

Introduction

In the previous article, we introduced vectors as the atoms of data. Matrices are the molecules — structured arrays of numbers that encode relationships between vectors, represent data sets, and describe transformations.

Every ML pipeline depends on matrices: a dataset is a matrix, a neural network layer is a matrix multiplication, and a covariance structure is a matrix. This article covers the mechanics of working with matrices — the operations, the rules, and the special types you will encounter constantly.

What Is a Matrix?

A matrix is a rectangular array of numbers arranged in rows and columns. A matrix A\mathbf{A} with mm rows and nn columns belongs to Rm×n\mathbb{R}^{m \times n}:

A=[a11a12a1na21a22a2nam1am2amn]\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}

We denote the entry in row ii, column jj as aija_{ij} or [A]ij[\mathbf{A}]_{ij}.

Key insight: In ML, a dataset with mm samples and nn features is stored as a matrix XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}. Each row is a data point. Each column is a feature. Matrix operations on X\mathbf{X} process the entire dataset at once.

Basic Matrix Operations

Addition and Scalar Multiplication

Matrices of the same size can be added element-wise. Scalar multiplication scales every entry:

A+B=[aij+bij],cA=[caij]\mathbf{A} + \mathbf{B} = [a_{ij} + b_{ij}], \qquad c\mathbf{A} = [c \cdot a_{ij}]

These operations inherit all the vector space properties from Rmn\mathbb{R}^{mn} — the set of m×nm \times n matrices is itself a vector space.

Matrix Multiplication

The product of ARm×p\mathbf{A} \in \mathbb{R}^{m \times p} and BRp×n\mathbf{B} \in \mathbb{R}^{p \times n} is a matrix CRm×n\mathbf{C} \in \mathbb{R}^{m \times n}:

cij=k=1paikbkjc_{ij} = \sum_{k=1}^{p} a_{ik} b_{kj}

Each entry cijc_{ij} is the dot product of row ii of A\mathbf{A} with column jj of B\mathbf{B}.

Warning: Matrix multiplication requires the inner dimensions to match: A\mathbf{A} is m×pm \times p and B\mathbf{B} is p×np \times n. The result is m×nm \times n.

Four Ways to Think About Matrix Multiplication

Understanding C=AB\mathbf{C} = \mathbf{A}\mathbf{B} from multiple angles builds deep intuition:

  1. Dot product view: cijc_{ij} = dot product of row ii of A\mathbf{A} with column jj of B\mathbf{B}.

  2. Column view: Each column of C\mathbf{C} is a linear combination of the columns of A\mathbf{A}, with coefficients from the corresponding column of B\mathbf{B}.

  3. Row view: Each row of C\mathbf{C} is a linear combination of the rows of B\mathbf{B}, with coefficients from the corresponding row of A\mathbf{A}.

  4. Outer product view: C=k=1pakbkT\mathbf{C} = \sum_{k=1}^{p} \mathbf{a}_k \mathbf{b}_k^T, where ak\mathbf{a}_k is column kk of A\mathbf{A} and bkT\mathbf{b}_k^T is row kk of B\mathbf{B}.

The column view is especially important: it tells us that multiplying Ax\mathbf{A}\mathbf{x} produces a linear combination of the columns of A\mathbf{A}.

Properties of Matrix Multiplication

PropertyStatementNotes
Associativity(AB)C=A(BC)(\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})Always holds
DistributivityA(B+C)=AB+AC\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}Always holds
Not commutativeABBA\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A} in generalCritical difference from scalars
Scalar compatibilityc(AB)=(cA)B=A(cB)c(\mathbf{A}\mathbf{B}) = (c\mathbf{A})\mathbf{B} = \mathbf{A}(c\mathbf{B})Always holds

Warning: The non-commutativity of matrix multiplication is a constant source of bugs and errors. The order matters: AB\mathbf{A}\mathbf{B} and BA\mathbf{B}\mathbf{A} may not even have the same dimensions, let alone the same value.

The Transpose

The transpose of ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} is ATRn×m\mathbf{A}^T \in \mathbb{R}^{n \times m}, obtained by swapping rows and columns:

[AT]ij=[A]ji[\mathbf{A}^T]_{ij} = [\mathbf{A}]_{ji}

Transpose Properties

(AT)T=A(A+B)T=AT+BT(cA)T=cAT(AB)T=BTAT\begin{aligned} (\mathbf{A}^T)^T &= \mathbf{A} \\[6pt] (\mathbf{A} + \mathbf{B})^T &= \mathbf{A}^T + \mathbf{B}^T \\[6pt] (c\mathbf{A})^T &= c\mathbf{A}^T \\[6pt] (\mathbf{A}\mathbf{B})^T &= \mathbf{B}^T \mathbf{A}^T \end{aligned}

The last rule — the reverse order law — is crucial. When transposing a product, the order reverses. This extends to any number of factors: (ABC)T=CTBTAT(\mathbf{A}\mathbf{B}\mathbf{C})^T = \mathbf{C}^T \mathbf{B}^T \mathbf{A}^T.

Key insight: The dot product of two vectors can be written as a matrix product: uv=uTv\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v}. This notation is universal in ML literature.

Special Matrices

Identity Matrix

The identity matrix In\mathbf{I}_n has ones on the diagonal and zeros elsewhere:

I3=[100010001]\mathbf{I}_3 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}

It is the multiplicative identity: AI=IA=A\mathbf{A}\mathbf{I} = \mathbf{I}\mathbf{A} = \mathbf{A}.

Diagonal Matrix

A diagonal matrix has nonzero entries only on the main diagonal:

D=diag(d1,d2,,dn)=[d1000d2000dn]\mathbf{D} = \text{diag}(d_1, d_2, \ldots, d_n) = \begin{bmatrix} d_1 & 0 & \cdots & 0 \\ 0 & d_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_n \end{bmatrix}

Multiplying by a diagonal matrix scales rows or columns. Diagonal matrices are computationally cheap — inversion, powers, and exponentiation all reduce to per-element operations.

Symmetric Matrix

A matrix is symmetric if A=AT\mathbf{A} = \mathbf{A}^T, meaning aij=ajia_{ij} = a_{ji} for all i,ji, j.

Symmetric matrices arise naturally in ML:

  • Covariance matrices: Σ=1nXTX\boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X} is always symmetric
  • Gram matrices: K=XXT\mathbf{K} = \mathbf{X}\mathbf{X}^T is always symmetric
  • Hessian matrices: The matrix of second derivatives of a smooth function

Symmetric matrices have remarkable properties: all eigenvalues are real, eigenvectors can be chosen orthogonal, and they can always be diagonalized. We explore this in detail in the eigenvalues article.

Orthogonal Matrix

A square matrix Q\mathbf{Q} is orthogonal if its columns are orthonormal vectors:

QTQ=QQT=I\mathbf{Q}^T\mathbf{Q} = \mathbf{Q}\mathbf{Q}^T = \mathbf{I}

This means Q1=QT\mathbf{Q}^{-1} = \mathbf{Q}^T — the inverse is just the transpose, which is computationally free.

Orthogonal matrices preserve lengths and angles: Qx=x\|\mathbf{Q}\mathbf{x}\| = \|\mathbf{x}\|. Geometrically, they represent rotations and reflections.

Triangular Matrices

An upper triangular matrix has zeros below the diagonal. A lower triangular matrix has zeros above. They arise in Gaussian elimination (LU decomposition) and are efficient to solve against.

L=[l1100l21l220l31l32l33],U=[u11u12u130u22u2300u33]\mathbf{L} = \begin{bmatrix} l_{11} & 0 & 0 \\ l_{21} & l_{22} & 0 \\ l_{31} & l_{32} & l_{33} \end{bmatrix}, \qquad \mathbf{U} = \begin{bmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{bmatrix}

Positive Definite and Positive Semi-Definite

A symmetric matrix A\mathbf{A} is:

  • Positive definite (PD) if xTAx>0\mathbf{x}^T\mathbf{A}\mathbf{x} > 0 for all x0\mathbf{x} \neq \mathbf{0}
  • Positive semi-definite (PSD) if xTAx0\mathbf{x}^T\mathbf{A}\mathbf{x} \geq 0 for all x\mathbf{x}

Covariance matrices are always PSD. A positive definite matrix has all positive eigenvalues, guaranteeing that optimization problems like minxxTAx\min_\mathbf{x} \mathbf{x}^T\mathbf{A}\mathbf{x} have a unique minimum.

Key insight: When the Hessian matrix of a loss function is positive definite at a point, that point is a local minimum. This is how we verify convergence in optimization.

The Inverse

The inverse of a square matrix ARn×n\mathbf{A} \in \mathbb{R}^{n \times n}, if it exists, is the unique matrix A1\mathbf{A}^{-1} satisfying:

AA1=A1A=I\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}

A matrix is invertible (or nonsingular) if and only if its determinant is nonzero, its rank equals nn, or equivalently, its columns are linearly independent.

Inverse Properties

(A1)1=A(AB)1=B1A1(AT)1=(A1)T\begin{aligned} (\mathbf{A}^{-1})^{-1} &= \mathbf{A} \\[6pt] (\mathbf{A}\mathbf{B})^{-1} &= \mathbf{B}^{-1}\mathbf{A}^{-1} \\[6pt] (\mathbf{A}^T)^{-1} &= (\mathbf{A}^{-1})^T \end{aligned}

Note the reverse order again: (AB)1=B1A1(\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}.

Warning: In practice, you almost never compute A1\mathbf{A}^{-1} explicitly. Solving Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} via factorization (LU, QR, Cholesky) is faster and more numerically stable. The notation A1b\mathbf{A}^{-1}\mathbf{b} means “solve the system,” not “compute the inverse and multiply.”

The Trace

The trace of a square matrix is the sum of its diagonal entries:

tr(A)=i=1naii\text{tr}(\mathbf{A}) = \sum_{i=1}^{n} a_{ii}

Trace Properties

tr(A+B)=tr(A)+tr(B)tr(cA)=ctr(A)tr(AT)=tr(A)tr(AB)=tr(BA)\begin{aligned} \text{tr}(\mathbf{A} + \mathbf{B}) &= \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) \\[6pt] \text{tr}(c\mathbf{A}) &= c \cdot \text{tr}(\mathbf{A}) \\[6pt] \text{tr}(\mathbf{A}^T) &= \text{tr}(\mathbf{A}) \\[6pt] \text{tr}(\mathbf{A}\mathbf{B}) &= \text{tr}(\mathbf{B}\mathbf{A}) \end{aligned}

The cyclic property tr(ABC)=tr(CAB)=tr(BCA)\text{tr}(\mathbf{A}\mathbf{B}\mathbf{C}) = \text{tr}(\mathbf{C}\mathbf{A}\mathbf{B}) = \text{tr}(\mathbf{B}\mathbf{C}\mathbf{A}) is used extensively in matrix calculus and ML derivations.

The trace also equals the sum of eigenvalues: tr(A)=iλi\text{tr}(\mathbf{A}) = \sum_i \lambda_i.

Matrix Rank

The rank of a matrix is the number of linearly independent columns (equivalently, the number of linearly independent rows — these are always equal):

rank(A)=dim(column space of A)\text{rank}(\mathbf{A}) = \dim(\text{column space of } \mathbf{A})
ConditionName
rank(A)=min(m,n)\text{rank}(\mathbf{A}) = \min(m, n)Full rank
rank(A)<min(m,n)\text{rank}(\mathbf{A}) < \min(m, n)Rank deficient
rank(A)=n\text{rank}(\mathbf{A}) = n for ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}Full column rank
rank(A)=m\text{rank}(\mathbf{A}) = m for ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}Full row rank

A square matrix is invertible if and only if it has full rank. In ML, rank deficiency signals redundant features (multicollinearity) or a degenerate covariance structure.

Worked Example: Matrix Multiplication in NumPy

import numpy as np

# Dataset: 3 samples, 2 features
X = np.array([[1, 2],
              [3, 4],
              [5, 6]])

# Weight vector for linear model
w = np.array([[0.5],
              [-0.3]])

# Predictions: matrix-vector product
y_hat = X @ w
print(y_hat)
# [[-0.1], [0.3], [0.7]]

# Gram matrix: X^T X (2x2 symmetric matrix)
gram = X.T @ X
print(gram)
# [[35, 44],
#  [44, 56]]

# Verify symmetry
print(np.allclose(gram, gram.T))  # True

Why This Matters for ML

  • Data representation: The entire training set is a matrix XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}.
  • Linear models: A prediction is y^=Xw\hat{y} = \mathbf{X}\mathbf{w}, a matrix-vector product.
  • Neural networks: Each layer computes h=σ(Wx+b)\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b}) — a matrix multiplication followed by a nonlinearity.
  • Covariance: The covariance matrix Σ=1nXTX\boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X} encodes feature correlations.
  • Optimization: The Hessian matrix determines the curvature of the loss landscape.
  • GPU acceleration: Modern ML is fast because matrix multiplication is massively parallelizable on GPUs.

Summary

  • A matrix is a rectangular array of numbers, representing data, transformations, or relationships.
  • Matrix multiplication is not commutative — order matters.
  • The transpose reverses order in products: (AB)T=BTAT(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T.
  • Special matrices (symmetric, orthogonal, diagonal, PD) have properties that ML exploits heavily.
  • The inverse solves linear systems but should rarely be computed explicitly.
  • The trace has a cyclic property essential for matrix calculus.
  • The rank measures the true dimensionality of the information in a matrix.
  • Next, we use matrices to represent and solve systems of linear equations.

References

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay