Matrices and Matrix Operations: Organizing Linear Computation

Linear Algebra Series 2 / 13

Introduction

In the previous article, we introduced vectors as the atoms of data. Matrices are the molecules — structured arrays of numbers that encode relationships between vectors, represent data sets, and describe transformations.

Every ML pipeline depends on matrices: a dataset is a matrix, a neural network layer is a matrix multiplication, and a covariance structure is a matrix. This article covers the mechanics of working with matrices — the operations, the rules, and the special types you will encounter constantly.

What Is a Matrix?

A matrix is a rectangular array of numbers arranged in rows and columns. A matrix $\mathbf{A}$ with $m$ rows and $n$ columns belongs to $\mathbb{R}^{m \times n}$ :

\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}

We denote the entry in row $i$ , column $j$ as $a_{ij}$ or $[\mathbf{A}]_{ij}$ .

Key insight: In ML, a dataset with $m$ samples and $n$ features is stored as a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ . Each row is a data point. Each column is a feature. Matrix operations on $\mathbf{X}$ process the entire dataset at once.

Basic Matrix Operations

Addition and Scalar Multiplication

Matrices of the same size can be added element-wise. Scalar multiplication scales every entry:

\mathbf{A} + \mathbf{B} = [a_{ij} + b_{ij}], \qquad c\mathbf{A} = [c \cdot a_{ij}]

These operations inherit all the vector space properties from $\mathbb{R}^{mn}$ — the set of $m \times n$ matrices is itself a vector space.

Matrix Multiplication

The product of $\mathbf{A} \in \mathbb{R}^{m \times p}$ and $\mathbf{B} \in \mathbb{R}^{p \times n}$ is a matrix $\mathbf{C} \in \mathbb{R}^{m \times n}$ :

c_{ij} = \sum_{k=1}^{p} a_{ik} b_{kj}

Each entry $c_{ij}$ is the dot product of row $i$ of $\mathbf{A}$ with column $j$ of $\mathbf{B}$ .

Warning: Matrix multiplication requires the inner dimensions to match: $\mathbf{A}$ is $m \times p$ and $\mathbf{B}$ is $p \times n$ . The result is $m \times n$ .

Four Ways to Think About Matrix Multiplication

Understanding $\mathbf{C} = \mathbf{A}\mathbf{B}$ from multiple angles builds deep intuition:

Dot product view: $c_{ij}$ = dot product of row $i$ of $\mathbf{A}$ with column $j$ of $\mathbf{B}$ .
Column view: Each column of $\mathbf{C}$ is a linear combination of the columns of $\mathbf{A}$ , with coefficients from the corresponding column of $\mathbf{B}$ .
Row view: Each row of $\mathbf{C}$ is a linear combination of the rows of $\mathbf{B}$ , with coefficients from the corresponding row of $\mathbf{A}$ .
Outer product view: $\mathbf{C} = \sum_{k=1}^{p} \mathbf{a}_k \mathbf{b}_k^T$ , where $\mathbf{a}_k$ is column $k$ of $\mathbf{A}$ and $\mathbf{b}_k^T$ is row $k$ of $\mathbf{B}$ .

The column view is especially important: it tells us that multiplying $\mathbf{A}\mathbf{x}$ produces a linear combination of the columns of $\mathbf{A}$ .

Properties of Matrix Multiplication

Property	Statement	Notes
Associativity	$(\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})$	Always holds
Distributivity	$\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}$	Always holds
Not commutative	$\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}$ in general	Critical difference from scalars
Scalar compatibility	$c(\mathbf{A}\mathbf{B}) = (c\mathbf{A})\mathbf{B} = \mathbf{A}(c\mathbf{B})$	Always holds

Warning: The non-commutativity of matrix multiplication is a constant source of bugs and errors. The order matters: $\mathbf{A}\mathbf{B}$ and $\mathbf{B}\mathbf{A}$ may not even have the same dimensions, let alone the same value.

The Transpose

The transpose of $\mathbf{A} \in \mathbb{R}^{m \times n}$ is $\mathbf{A}^T \in \mathbb{R}^{n \times m}$ , obtained by swapping rows and columns:

[\mathbf{A}^T]_{ij} = [\mathbf{A}]_{ji}

Transpose Properties

\begin{aligned} (\mathbf{A}^T)^T &= \mathbf{A} \\[6pt] (\mathbf{A} + \mathbf{B})^T &= \mathbf{A}^T + \mathbf{B}^T \\[6pt] (c\mathbf{A})^T &= c\mathbf{A}^T \\[6pt] (\mathbf{A}\mathbf{B})^T &= \mathbf{B}^T \mathbf{A}^T \end{aligned}

The last rule — the reverse order law — is crucial. When transposing a product, the order reverses. This extends to any number of factors: $(\mathbf{A}\mathbf{B}\mathbf{C})^T = \mathbf{C}^T \mathbf{B}^T \mathbf{A}^T$ .

Key insight: The dot product of two vectors can be written as a matrix product: $\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v}$ . This notation is universal in ML literature.

Special Matrices

Identity Matrix

The identity matrix $\mathbf{I}_n$ has ones on the diagonal and zeros elsewhere:

\mathbf{I}_3 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}

It is the multiplicative identity: $\mathbf{A}\mathbf{I} = \mathbf{I}\mathbf{A} = \mathbf{A}$ .

Diagonal Matrix

A diagonal matrix has nonzero entries only on the main diagonal:

\mathbf{D} = \text{diag}(d_1, d_2, \ldots, d_n) = \begin{bmatrix} d_1 & 0 & \cdots & 0 \\ 0 & d_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_n \end{bmatrix}

Multiplying by a diagonal matrix scales rows or columns. Diagonal matrices are computationally cheap — inversion, powers, and exponentiation all reduce to per-element operations.

Symmetric Matrix

A matrix is symmetric if $\mathbf{A} = \mathbf{A}^T$ , meaning $a_{ij} = a_{ji}$ for all $i, j$ .

Symmetric matrices arise naturally in ML:

Covariance matrices: $\boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$ is always symmetric
Gram matrices: $\mathbf{K} = \mathbf{X}\mathbf{X}^T$ is always symmetric
Hessian matrices: The matrix of second derivatives of a smooth function

Symmetric matrices have remarkable properties: all eigenvalues are real, eigenvectors can be chosen orthogonal, and they can always be diagonalized. We explore this in detail in the eigenvalues article.

Orthogonal Matrix

A square matrix $\mathbf{Q}$ is orthogonal if its columns are orthonormal vectors:

\mathbf{Q}^T\mathbf{Q} = \mathbf{Q}\mathbf{Q}^T = \mathbf{I}

This means $\mathbf{Q}^{-1} = \mathbf{Q}^T$ — the inverse is just the transpose, which is computationally free.

Orthogonal matrices preserve lengths and angles: $\|\mathbf{Q}\mathbf{x}\| = \|\mathbf{x}\|$ . Geometrically, they represent rotations and reflections.

Triangular Matrices

An upper triangular matrix has zeros below the diagonal. A lower triangular matrix has zeros above. They arise in Gaussian elimination (LU decomposition) and are efficient to solve against.

\mathbf{L} = \begin{bmatrix} l_{11} & 0 & 0 \\ l_{21} & l_{22} & 0 \\ l_{31} & l_{32} & l_{33} \end{bmatrix}, \qquad \mathbf{U} = \begin{bmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{bmatrix}

Positive Definite and Positive Semi-Definite

A symmetric matrix $\mathbf{A}$ is:

Positive definite (PD) if $\mathbf{x}^T\mathbf{A}\mathbf{x} > 0$ for all $\mathbf{x} \neq \mathbf{0}$
Positive semi-definite (PSD) if $\mathbf{x}^T\mathbf{A}\mathbf{x} \geq 0$ for all $\mathbf{x}$

Covariance matrices are always PSD. A positive definite matrix has all positive eigenvalues, guaranteeing that optimization problems like $\min_\mathbf{x} \mathbf{x}^T\mathbf{A}\mathbf{x}$ have a unique minimum.

Key insight: When the Hessian matrix of a loss function is positive definite at a point, that point is a local minimum. This is how we verify convergence in optimization.

The Inverse

The inverse of a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ , if it exists, is the unique matrix $\mathbf{A}^{-1}$ satisfying:

\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}

A matrix is invertible (or nonsingular) if and only if its determinant is nonzero, its rank equals $n$ , or equivalently, its columns are linearly independent.

Inverse Properties

\begin{aligned} (\mathbf{A}^{-1})^{-1} &= \mathbf{A} \\[6pt] (\mathbf{A}\mathbf{B})^{-1} &= \mathbf{B}^{-1}\mathbf{A}^{-1} \\[6pt] (\mathbf{A}^T)^{-1} &= (\mathbf{A}^{-1})^T \end{aligned}

Note the reverse order again: $(\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}$ .

Warning: In practice, you almost never compute $\mathbf{A}^{-1}$ explicitly. Solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ via factorization (LU, QR, Cholesky) is faster and more numerically stable. The notation $\mathbf{A}^{-1}\mathbf{b}$ means “solve the system,” not “compute the inverse and multiply.”

The Trace

The trace of a square matrix is the sum of its diagonal entries:

\text{tr}(\mathbf{A}) = \sum_{i=1}^{n} a_{ii}

Trace Properties

\begin{aligned} \text{tr}(\mathbf{A} + \mathbf{B}) &= \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) \\[6pt] \text{tr}(c\mathbf{A}) &= c \cdot \text{tr}(\mathbf{A}) \\[6pt] \text{tr}(\mathbf{A}^T) &= \text{tr}(\mathbf{A}) \\[6pt] \text{tr}(\mathbf{A}\mathbf{B}) &= \text{tr}(\mathbf{B}\mathbf{A}) \end{aligned}

The cyclic property $\text{tr}(\mathbf{A}\mathbf{B}\mathbf{C}) = \text{tr}(\mathbf{C}\mathbf{A}\mathbf{B}) = \text{tr}(\mathbf{B}\mathbf{C}\mathbf{A})$ is used extensively in matrix calculus and ML derivations.

The trace also equals the sum of eigenvalues: $\text{tr}(\mathbf{A}) = \sum_i \lambda_i$ .

Matrix Rank

The rank of a matrix is the number of linearly independent columns (equivalently, the number of linearly independent rows — these are always equal):

\text{rank}(\mathbf{A}) = \dim(\text{column space of } \mathbf{A})

Condition	Name
$\text{rank}(\mathbf{A}) = \min(m, n)$	Full rank
$\text{rank}(\mathbf{A}) < \min(m, n)$	Rank deficient
$\text{rank}(\mathbf{A}) = n$ for $\mathbf{A} \in \mathbb{R}^{m \times n}$	Full column rank
$\text{rank}(\mathbf{A}) = m$ for $\mathbf{A} \in \mathbb{R}^{m \times n}$	Full row rank

A square matrix is invertible if and only if it has full rank. In ML, rank deficiency signals redundant features (multicollinearity) or a degenerate covariance structure.

Worked Example: Matrix Multiplication in NumPy

import numpy as np

# Dataset: 3 samples, 2 features
X = np.array([[1, 2],
              [3, 4],
              [5, 6]])

# Weight vector for linear model
w = np.array([[0.5],
              [-0.3]])

# Predictions: matrix-vector product
y_hat = X @ w
print(y_hat)
# [[-0.1], [0.3], [0.7]]

# Gram matrix: X^T X (2x2 symmetric matrix)
gram = X.T @ X
print(gram)
# [[35, 44],
#  [44, 56]]

# Verify symmetry
print(np.allclose(gram, gram.T))  # True

Why This Matters for ML

Data representation: The entire training set is a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ .
Linear models: A prediction is $\hat{y} = \mathbf{X}\mathbf{w}$ , a matrix-vector product.
Neural networks: Each layer computes $\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$ — a matrix multiplication followed by a nonlinearity.
Covariance: The covariance matrix $\boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$ encodes feature correlations.
Optimization: The Hessian matrix determines the curvature of the loss landscape.
GPU acceleration: Modern ML is fast because matrix multiplication is massively parallelizable on GPUs.

Summary

A matrix is a rectangular array of numbers, representing data, transformations, or relationships.
Matrix multiplication is not commutative — order matters.
The transpose reverses order in products: $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T$ .
Special matrices (symmetric, orthogonal, diagonal, PD) have properties that ML exploits heavily.
The inverse solves linear systems but should rarely be computed explicitly.
The trace has a cyclic property essential for matrix calculus.
The rank measures the true dimensionality of the information in a matrix.
Next, we use matrices to represent and solve systems of linear equations.

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2. MIT Press. deeplearningbook.org