Linear Transformations: Matrices as Functions

See matrices as geometric transformations — rotations, reflections, projections, and shears — and understand the connection to neural networks.

Linear Algebra March 6, 2026 8 min read

Introduction

So far, we have treated matrices as static tables of numbers. But a matrix is also a function — it takes a vector as input and produces a vector as output. This perspective transforms linear algebra from bookkeeping into geometry: matrices rotate, stretch, reflect, project, and shear space.

Every layer of a neural network applies a linear transformation followed by a nonlinearity. Understanding what linear transformations can and cannot do explains why nonlinearities are necessary and what each layer “sees.” This article builds on determinants and connects directly to how ML models manipulate data.

Definition

A function T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m is a linear transformation if it satisfies two properties for all vectors u,v\mathbf{u}, \mathbf{v} and scalars cc:

  1. Additivity: T(u+v)=T(u)+T(v)T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})
  2. Homogeneity: T(cu)=cT(u)T(c\mathbf{u}) = cT(\mathbf{u})

These can be combined into a single condition:

T(c1u+c2v)=c1T(u)+c2T(v)T(c_1\mathbf{u} + c_2\mathbf{v}) = c_1 T(\mathbf{u}) + c_2 T(\mathbf{v})

A linear transformation preserves linear combinations. This is both its strength (predictable, analyzable) and its limitation (it cannot model nonlinear patterns alone).

Key insight: Every linear transformation from Rn\mathbb{R}^n to Rm\mathbb{R}^m can be represented as multiplication by a unique m×nm \times n matrix. Conversely, every m×nm \times n matrix defines a linear transformation. Matrices and linear transformations are two views of the same object.

From Transformation to Matrix

Given a linear transformation T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m, its matrix representation is built from where it sends the standard basis vectors:

A=[T(e1)T(e2)T(en)]\mathbf{A} = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

Column jj of A\mathbf{A} is the image of the jj-th standard basis vector under TT.

Then for any vector x\mathbf{x}:

T(x)=Ax=x1T(e1)+x2T(e2)++xnT(en)T(\mathbf{x}) = \mathbf{A}\mathbf{x} = x_1 T(\mathbf{e}_1) + x_2 T(\mathbf{e}_2) + \cdots + x_n T(\mathbf{e}_n)

This is the column view of matrix multiplication: Ax\mathbf{A}\mathbf{x} is a linear combination of the columns of A\mathbf{A}.

Visualizing 2D transformations builds geometric intuition that extends to higher dimensions.

Scaling

A=[sx00sy]\mathbf{A} = \begin{bmatrix} s_x & 0 \\ 0 & s_y \end{bmatrix}

Stretches by sxs_x along the xx-axis and sys_y along the yy-axis. Uniform scaling (sx=sys_x = s_y) preserves shape; non-uniform scaling distorts it.

Rotation

Rotation by angle θ\theta counterclockwise:

Rθ=[cosθsinθsinθcosθ]\mathbf{R}_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

det(Rθ)=1\det(\mathbf{R}_\theta) = 1 — rotations preserve area and orientation. Rθ\mathbf{R}_\theta is an orthogonal matrix: Rθ1=RθT=Rθ\mathbf{R}_\theta^{-1} = \mathbf{R}_\theta^T = \mathbf{R}_{-\theta}.

Reflection

Reflection across the xx-axis:

[1001]\begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}

Reflection across a line through the origin at angle θ/2\theta/2:

[cosθsinθsinθcosθ]\begin{bmatrix} \cos\theta & \sin\theta \\ \sin\theta & -\cos\theta \end{bmatrix}

det=1\det = -1 — reflections reverse orientation.

Shear

Horizontal shear by factor kk:

[1k01]\begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}

Shears shift points parallel to one axis. det=1\det = 1 — area is preserved.

Projection

Projection onto the xx-axis:

[1000]\begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}

det=0\det = 0 — projections collapse a dimension and are not invertible.

Transformationdet\detInvertiblePreserves
Rotation11YesLengths, angles, area
Reflection1-1YesLengths, angles, area (reverses orientation)
Scalingsxsys_x s_yYes (if sx,sy0s_x, s_y \neq 0)Angles (if uniform)
Shear11YesArea
Projection00NoNothing fully

Composition of Transformations

Applying transformation T1T_1 (matrix A\mathbf{A}) followed by T2T_2 (matrix B\mathbf{B}) is the composition T2T1T_2 \circ T_1, represented by the product BA\mathbf{B}\mathbf{A}:

(T2T1)(x)=B(Ax)=(BA)x(T_2 \circ T_1)(\mathbf{x}) = \mathbf{B}(\mathbf{A}\mathbf{x}) = (\mathbf{B}\mathbf{A})\mathbf{x}

Warning: Note the order — T1T_1 is applied first but appears on the right in the matrix product. This is because matrix multiplication acts from right to left.

This is exactly what happens in a neural network: each layer applies a transformation, and the overall network is a composition of transformations (with nonlinearities between them).

The Image and Kernel

For a linear transformation T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m with matrix A\mathbf{A}:

The image (or range) is the set of all possible outputs:

Im(T)={Ax:xRn}=C(A)\text{Im}(T) = \{\mathbf{A}\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} = C(\mathbf{A})

This is the column space of A\mathbf{A}.

The kernel (or null space) is the set of inputs that map to zero:

ker(T)={x:Ax=0}=N(A)\ker(T) = \{\mathbf{x} : \mathbf{A}\mathbf{x} = \mathbf{0}\} = N(\mathbf{A})

The rank-nullity theorem connects them:

dim(Im(T))+dim(ker(T))=n\dim(\text{Im}(T)) + \dim(\ker(T)) = n

Key insight: A transformation is injective (one-to-one) if and only if ker(T)={0}\ker(T) = \{\mathbf{0}\} — nothing nonzero gets mapped to zero. It is surjective (onto) if and only if Im(T)=Rm\text{Im}(T) = \mathbb{R}^m — every output is reachable.

Change of Basis

A linear transformation is an intrinsic geometric operation, but its matrix representation depends on the choice of basis. If A\mathbf{A} is the matrix of TT in the standard basis and P\mathbf{P} is a change-of-basis matrix (columns are the new basis vectors), then the matrix of TT in the new basis is:

A=P1AP\mathbf{A}' = \mathbf{P}^{-1}\mathbf{A}\mathbf{P}

This operation is called a similarity transformation. Matrices related by similarity represent the same linear transformation in different coordinate systems.

Key insight: Diagonalization is finding a basis where the transformation is just scaling along each axis. If A=PDP1\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1} where D\mathbf{D} is diagonal, then in the basis defined by the columns of P\mathbf{P}, the transformation simply scales each coordinate independently.

Affine Transformations

In practice, ML models use affine transformations — a linear transformation plus a translation:

T(x)=Wx+bT(\mathbf{x}) = \mathbf{W}\mathbf{x} + \mathbf{b}

This is not linear (it does not preserve the origin) but it is the core computation of every fully connected neural network layer.

The translation b\mathbf{b} (the bias term) lets the model shift the decision boundary away from the origin. Without it, every hyperplane would pass through the origin — a severe limitation.

Homogeneous Coordinates

Affine transformations can be made linear by adding an extra dimension:

[Wb0T1][x1]=[Wx+b1]\begin{bmatrix} \mathbf{W} & \mathbf{b} \\ \mathbf{0}^T & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{W}\mathbf{x} + \mathbf{b} \\ 1 \end{bmatrix}

This trick is standard in computer graphics and sometimes appears in ML theory.

Linear Transformations in Neural Networks

A neural network layer computes:

h=σ(Wx+b)\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})

The linear part Wx+b\mathbf{W}\mathbf{x} + \mathbf{b} is an affine transformation. The nonlinearity σ\sigma (ReLU, sigmoid, etc.) is applied element-wise.

Why are nonlinearities essential? Because the composition of linear transformations is still linear:

W2(W1x)=(W2W1)x=Wx\mathbf{W}_2(\mathbf{W}_1\mathbf{x}) = (\mathbf{W}_2\mathbf{W}_1)\mathbf{x} = \mathbf{W}'\mathbf{x}

Without nonlinearities, a 100-layer network would be equivalent to a single matrix multiplication. The nonlinearities break linearity and allow the network to learn complex, curved decision boundaries.

Key insight: Each layer of a neural network rotates, stretches, and shifts the data (linear/affine part), then bends and folds the space (nonlinearity). Stacking many such operations creates the complex mappings that make deep learning powerful.

Worked Example: Rotation Followed by Scaling

Apply a 45° rotation followed by scaling by 2 along the xx-axis and 0.5 along the yy-axis.

Rotation matrix:

R45°=[cos45°sin45°sin45°cos45°]=[22222222]\mathbf{R}_{45°} = \begin{bmatrix} \cos 45° & -\sin 45° \\ \sin 45° & \cos 45° \end{bmatrix} = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} \end{bmatrix}

Scaling matrix:

S=[2000.5]\mathbf{S} = \begin{bmatrix} 2 & 0 \\ 0 & 0.5 \end{bmatrix}

Combined transformation (scaling after rotation):

A=SR45°=[222424]\mathbf{A} = \mathbf{S}\mathbf{R}_{45°} = \begin{bmatrix} \sqrt{2} & -\sqrt{2} \\ \frac{\sqrt{2}}{4} & \frac{\sqrt{2}}{4} \end{bmatrix}
import numpy as np

theta = np.pi / 4
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])

S = np.array([[2.0, 0.0],
              [0.0, 0.5]])

A = S @ R
print(A)
# [[ 1.414  -1.414]
#  [ 0.354   0.354]]

# Apply to a unit vector
x = np.array([1, 0])
print(A @ x)  # [1.414, 0.354]

Why This Matters for ML

  • Neural network layers are affine transformations followed by nonlinearities. Understanding what linear maps can do helps you understand network capacity.
  • Data preprocessing: Standardization, whitening, and PCA are linear transformations applied to data before training.
  • Attention mechanisms in Transformers use linear projections (query, key, value matrices) to transform token representations.
  • Convolutional layers are a special case of linear transformation with weight sharing and locality constraints.
  • Feature maps: Each layer’s weight matrix defines which linear combinations of input features to extract.

Summary

  • A linear transformation is a function that preserves linear combinations: T(c1u+c2v)=c1T(u)+c2T(v)T(c_1\mathbf{u} + c_2\mathbf{v}) = c_1 T(\mathbf{u}) + c_2 T(\mathbf{v}).
  • Every linear transformation RnRm\mathbb{R}^n \to \mathbb{R}^m corresponds to a unique m×nm \times n matrix.
  • Standard transformations include rotations, reflections, scaling, shear, and projections.
  • Composition of transformations corresponds to matrix multiplication (applied right to left).
  • The image is the column space (reachable outputs) and the kernel is the null space (inputs mapped to zero).
  • Change of basis gives different matrix representations of the same transformation.
  • Neural networks compose affine transformations with nonlinearities — without the nonlinearities, depth would be meaningless.
  • Next, we formalize angles and distances in inner products, norms, and orthogonality.

References

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay