- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
In the previous article, we introduced vectors as the atoms of data. Matrices are the molecules — structured arrays of numbers that encode relationships between vectors, represent data sets, and describe transformations.
Every ML pipeline depends on matrices: a dataset is a matrix, a neural network layer is a matrix multiplication, and a covariance structure is a matrix. This article covers the mechanics of working with matrices — the operations, the rules, and the special types you will encounter constantly.
What Is a Matrix?
A matrix is a rectangular array of numbers arranged in rows and columns. A matrix with rows and columns belongs to :
We denote the entry in row , column as or .
Key insight: In ML, a dataset with samples and features is stored as a matrix . Each row is a data point. Each column is a feature. Matrix operations on process the entire dataset at once.
Basic Matrix Operations
Addition and Scalar Multiplication
Matrices of the same size can be added element-wise. Scalar multiplication scales every entry:
These operations inherit all the vector space properties from — the set of matrices is itself a vector space.
Matrix Multiplication
The product of and is a matrix :
Each entry is the dot product of row of with column of .
Warning: Matrix multiplication requires the inner dimensions to match: is and is . The result is .
Four Ways to Think About Matrix Multiplication
Understanding from multiple angles builds deep intuition:
-
Dot product view: = dot product of row of with column of .
-
Column view: Each column of is a linear combination of the columns of , with coefficients from the corresponding column of .
-
Row view: Each row of is a linear combination of the rows of , with coefficients from the corresponding row of .
-
Outer product view: , where is column of and is row of .
The column view is especially important: it tells us that multiplying produces a linear combination of the columns of .
Properties of Matrix Multiplication
| Property | Statement | Notes |
|---|---|---|
| Associativity | Always holds | |
| Distributivity | Always holds | |
| Not commutative | in general | Critical difference from scalars |
| Scalar compatibility | Always holds |
Warning: The non-commutativity of matrix multiplication is a constant source of bugs and errors. The order matters: and may not even have the same dimensions, let alone the same value.
The Transpose
The transpose of is , obtained by swapping rows and columns:
Transpose Properties
The last rule — the reverse order law — is crucial. When transposing a product, the order reverses. This extends to any number of factors: .
Key insight: The dot product of two vectors can be written as a matrix product: . This notation is universal in ML literature.
Special Matrices
Identity Matrix
The identity matrix has ones on the diagonal and zeros elsewhere:
It is the multiplicative identity: .
Diagonal Matrix
A diagonal matrix has nonzero entries only on the main diagonal:
Multiplying by a diagonal matrix scales rows or columns. Diagonal matrices are computationally cheap — inversion, powers, and exponentiation all reduce to per-element operations.
Symmetric Matrix
A matrix is symmetric if , meaning for all .
Symmetric matrices arise naturally in ML:
- Covariance matrices: is always symmetric
- Gram matrices: is always symmetric
- Hessian matrices: The matrix of second derivatives of a smooth function
Symmetric matrices have remarkable properties: all eigenvalues are real, eigenvectors can be chosen orthogonal, and they can always be diagonalized. We explore this in detail in the eigenvalues article.
Orthogonal Matrix
A square matrix is orthogonal if its columns are orthonormal vectors:
This means — the inverse is just the transpose, which is computationally free.
Orthogonal matrices preserve lengths and angles: . Geometrically, they represent rotations and reflections.
Triangular Matrices
An upper triangular matrix has zeros below the diagonal. A lower triangular matrix has zeros above. They arise in Gaussian elimination (LU decomposition) and are efficient to solve against.
Positive Definite and Positive Semi-Definite
A symmetric matrix is:
- Positive definite (PD) if for all
- Positive semi-definite (PSD) if for all
Covariance matrices are always PSD. A positive definite matrix has all positive eigenvalues, guaranteeing that optimization problems like have a unique minimum.
Key insight: When the Hessian matrix of a loss function is positive definite at a point, that point is a local minimum. This is how we verify convergence in optimization.
The Inverse
The inverse of a square matrix , if it exists, is the unique matrix satisfying:
A matrix is invertible (or nonsingular) if and only if its determinant is nonzero, its rank equals , or equivalently, its columns are linearly independent.
Inverse Properties
Note the reverse order again: .
Warning: In practice, you almost never compute explicitly. Solving via factorization (LU, QR, Cholesky) is faster and more numerically stable. The notation means “solve the system,” not “compute the inverse and multiply.”
The Trace
The trace of a square matrix is the sum of its diagonal entries:
Trace Properties
The cyclic property is used extensively in matrix calculus and ML derivations.
The trace also equals the sum of eigenvalues: .
Matrix Rank
The rank of a matrix is the number of linearly independent columns (equivalently, the number of linearly independent rows — these are always equal):
| Condition | Name |
|---|---|
| Full rank | |
| Rank deficient | |
| for | Full column rank |
| for | Full row rank |
A square matrix is invertible if and only if it has full rank. In ML, rank deficiency signals redundant features (multicollinearity) or a degenerate covariance structure.
Worked Example: Matrix Multiplication in NumPy
import numpy as np
# Dataset: 3 samples, 2 features
X = np.array([[1, 2],
[3, 4],
[5, 6]])
# Weight vector for linear model
w = np.array([[0.5],
[-0.3]])
# Predictions: matrix-vector product
y_hat = X @ w
print(y_hat)
# [[-0.1], [0.3], [0.7]]
# Gram matrix: X^T X (2x2 symmetric matrix)
gram = X.T @ X
print(gram)
# [[35, 44],
# [44, 56]]
# Verify symmetry
print(np.allclose(gram, gram.T)) # True
Why This Matters for ML
- Data representation: The entire training set is a matrix .
- Linear models: A prediction is , a matrix-vector product.
- Neural networks: Each layer computes — a matrix multiplication followed by a nonlinearity.
- Covariance: The covariance matrix encodes feature correlations.
- Optimization: The Hessian matrix determines the curvature of the loss landscape.
- GPU acceleration: Modern ML is fast because matrix multiplication is massively parallelizable on GPUs.
Summary
- A matrix is a rectangular array of numbers, representing data, transformations, or relationships.
- Matrix multiplication is not commutative — order matters.
- The transpose reverses order in products: .
- Special matrices (symmetric, orthogonal, diagonal, PD) have properties that ML exploits heavily.
- The inverse solves linear systems but should rarely be computed explicitly.
- The trace has a cyclic property essential for matrix calculus.
- The rank measures the true dimensionality of the information in a matrix.
- Next, we use matrices to represent and solve systems of linear equations.
References
- Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
- Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2. MIT Press. deeplearningbook.org