- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
So far, we have treated matrices as static tables of numbers. But a matrix is also a function — it takes a vector as input and produces a vector as output. This perspective transforms linear algebra from bookkeeping into geometry: matrices rotate, stretch, reflect, project, and shear space.
Every layer of a neural network applies a linear transformation followed by a nonlinearity. Understanding what linear transformations can and cannot do explains why nonlinearities are necessary and what each layer “sees.” This article builds on determinants and connects directly to how ML models manipulate data.
Definition
A function is a linear transformation if it satisfies two properties for all vectors and scalars :
- Additivity:
- Homogeneity:
These can be combined into a single condition:
A linear transformation preserves linear combinations. This is both its strength (predictable, analyzable) and its limitation (it cannot model nonlinear patterns alone).
Key insight: Every linear transformation from to can be represented as multiplication by a unique matrix. Conversely, every matrix defines a linear transformation. Matrices and linear transformations are two views of the same object.
From Transformation to Matrix
Given a linear transformation , its matrix representation is built from where it sends the standard basis vectors:
Column of is the image of the -th standard basis vector under .
Then for any vector :
This is the column view of matrix multiplication: is a linear combination of the columns of .
Gallery of 2D Transformations
Visualizing 2D transformations builds geometric intuition that extends to higher dimensions.
Scaling
Stretches by along the -axis and along the -axis. Uniform scaling () preserves shape; non-uniform scaling distorts it.
Rotation
Rotation by angle counterclockwise:
— rotations preserve area and orientation. is an orthogonal matrix: .
Reflection
Reflection across the -axis:
Reflection across a line through the origin at angle :
— reflections reverse orientation.
Shear
Horizontal shear by factor :
Shears shift points parallel to one axis. — area is preserved.
Projection
Projection onto the -axis:
— projections collapse a dimension and are not invertible.
| Transformation | Invertible | Preserves | |
|---|---|---|---|
| Rotation | Yes | Lengths, angles, area | |
| Reflection | Yes | Lengths, angles, area (reverses orientation) | |
| Scaling | Yes (if ) | Angles (if uniform) | |
| Shear | Yes | Area | |
| Projection | No | Nothing fully |
Composition of Transformations
Applying transformation (matrix ) followed by (matrix ) is the composition , represented by the product :
Warning: Note the order — is applied first but appears on the right in the matrix product. This is because matrix multiplication acts from right to left.
This is exactly what happens in a neural network: each layer applies a transformation, and the overall network is a composition of transformations (with nonlinearities between them).
The Image and Kernel
For a linear transformation with matrix :
The image (or range) is the set of all possible outputs:
This is the column space of .
The kernel (or null space) is the set of inputs that map to zero:
The rank-nullity theorem connects them:
Key insight: A transformation is injective (one-to-one) if and only if — nothing nonzero gets mapped to zero. It is surjective (onto) if and only if — every output is reachable.
Change of Basis
A linear transformation is an intrinsic geometric operation, but its matrix representation depends on the choice of basis. If is the matrix of in the standard basis and is a change-of-basis matrix (columns are the new basis vectors), then the matrix of in the new basis is:
This operation is called a similarity transformation. Matrices related by similarity represent the same linear transformation in different coordinate systems.
Key insight: Diagonalization is finding a basis where the transformation is just scaling along each axis. If where is diagonal, then in the basis defined by the columns of , the transformation simply scales each coordinate independently.
Affine Transformations
In practice, ML models use affine transformations — a linear transformation plus a translation:
This is not linear (it does not preserve the origin) but it is the core computation of every fully connected neural network layer.
The translation (the bias term) lets the model shift the decision boundary away from the origin. Without it, every hyperplane would pass through the origin — a severe limitation.
Homogeneous Coordinates
Affine transformations can be made linear by adding an extra dimension:
This trick is standard in computer graphics and sometimes appears in ML theory.
Linear Transformations in Neural Networks
A neural network layer computes:
The linear part is an affine transformation. The nonlinearity (ReLU, sigmoid, etc.) is applied element-wise.
Why are nonlinearities essential? Because the composition of linear transformations is still linear:
Without nonlinearities, a 100-layer network would be equivalent to a single matrix multiplication. The nonlinearities break linearity and allow the network to learn complex, curved decision boundaries.
Key insight: Each layer of a neural network rotates, stretches, and shifts the data (linear/affine part), then bends and folds the space (nonlinearity). Stacking many such operations creates the complex mappings that make deep learning powerful.
Worked Example: Rotation Followed by Scaling
Apply a 45° rotation followed by scaling by 2 along the -axis and 0.5 along the -axis.
Rotation matrix:
Scaling matrix:
Combined transformation (scaling after rotation):
import numpy as np
theta = np.pi / 4
R = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
S = np.array([[2.0, 0.0],
[0.0, 0.5]])
A = S @ R
print(A)
# [[ 1.414 -1.414]
# [ 0.354 0.354]]
# Apply to a unit vector
x = np.array([1, 0])
print(A @ x) # [1.414, 0.354]
Why This Matters for ML
- Neural network layers are affine transformations followed by nonlinearities. Understanding what linear maps can do helps you understand network capacity.
- Data preprocessing: Standardization, whitening, and PCA are linear transformations applied to data before training.
- Attention mechanisms in Transformers use linear projections (query, key, value matrices) to transform token representations.
- Convolutional layers are a special case of linear transformation with weight sharing and locality constraints.
- Feature maps: Each layer’s weight matrix defines which linear combinations of input features to extract.
Summary
- A linear transformation is a function that preserves linear combinations: .
- Every linear transformation corresponds to a unique matrix.
- Standard transformations include rotations, reflections, scaling, shear, and projections.
- Composition of transformations corresponds to matrix multiplication (applied right to left).
- The image is the column space (reachable outputs) and the kernel is the null space (inputs mapped to zero).
- Change of basis gives different matrix representations of the same transformation.
- Neural networks compose affine transformations with nonlinearities — without the nonlinearities, depth would be meaningless.
- Next, we formalize angles and distances in inner products, norms, and orthogonality.
References
- Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
- 3Blue1Brown (2016). Essence of Linear Algebra. youtube.com/3blue1brown
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapters 2, 6. MIT Press. deeplearningbook.org