Linear Transformations: Matrices as Functions

Linear Algebra Series 5 / 13

Introduction

So far, we have treated matrices as static tables of numbers. But a matrix is also a function — it takes a vector as input and produces a vector as output. This perspective transforms linear algebra from bookkeeping into geometry: matrices rotate, stretch, reflect, project, and shear space.

Every layer of a neural network applies a linear transformation followed by a nonlinearity. Understanding what linear transformations can and cannot do explains why nonlinearities are necessary and what each layer “sees.” This article builds on determinants and connects directly to how ML models manipulate data.

Definition

A function $T: \mathbb{R}^n \to \mathbb{R}^m$ is a linear transformation if it satisfies two properties for all vectors $\mathbf{u}, \mathbf{v}$ and scalars $c$ :

Additivity: $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$
Homogeneity: $T(c\mathbf{u}) = cT(\mathbf{u})$

These can be combined into a single condition:

T(c_1\mathbf{u} + c_2\mathbf{v}) = c_1 T(\mathbf{u}) + c_2 T(\mathbf{v})

A linear transformation preserves linear combinations. This is both its strength (predictable, analyzable) and its limitation (it cannot model nonlinear patterns alone).

Key insight: Every linear transformation from $\mathbb{R}^n$ to $\mathbb{R}^m$ can be represented as multiplication by a unique $m \times n$ matrix. Conversely, every $m \times n$ matrix defines a linear transformation. Matrices and linear transformations are two views of the same object.

From Transformation to Matrix

Given a linear transformation $T: \mathbb{R}^n \to \mathbb{R}^m$ , its matrix representation is built from where it sends the standard basis vectors:

\mathbf{A} = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

Column $j$ of $\mathbf{A}$ is the image of the $j$ -th standard basis vector under $T$ .

Then for any vector $\mathbf{x}$ :

T(\mathbf{x}) = \mathbf{A}\mathbf{x} = x_1 T(\mathbf{e}_1) + x_2 T(\mathbf{e}_2) + \cdots + x_n T(\mathbf{e}_n)

This is the column view of matrix multiplication: $\mathbf{A}\mathbf{x}$ is a linear combination of the columns of $\mathbf{A}$ .

Gallery of 2D Transformations

Visualizing 2D transformations builds geometric intuition that extends to higher dimensions.

Scaling

\mathbf{A} = \begin{bmatrix} s_x & 0 \\ 0 & s_y \end{bmatrix}

Stretches by $s_x$ along the $x$ -axis and $s_y$ along the $y$ -axis. Uniform scaling ( $s_x = s_y$ ) preserves shape; non-uniform scaling distorts it.

Rotation

Rotation by angle $\theta$ counterclockwise:

\mathbf{R}_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

$\det(\mathbf{R}_\theta) = 1$ — rotations preserve area and orientation. $\mathbf{R}_\theta$ is an orthogonal matrix: $\mathbf{R}_\theta^{-1} = \mathbf{R}_\theta^T = \mathbf{R}_{-\theta}$ .

Reflection

Reflection across the $x$ -axis:

\begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}

Reflection across a line through the origin at angle $\theta/2$ :

\begin{bmatrix} \cos\theta & \sin\theta \\ \sin\theta & -\cos\theta \end{bmatrix}

$\det = -1$ — reflections reverse orientation.

Shear

Horizontal shear by factor $k$ :

\begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}

Shears shift points parallel to one axis. $\det = 1$ — area is preserved.

Projection

Projection onto the $x$ -axis:

\begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}

$\det = 0$ — projections collapse a dimension and are not invertible.

Transformation	$\det$	Invertible	Preserves
Rotation	$1$	Yes	Lengths, angles, area
Reflection	$-1$	Yes	Lengths, angles, area (reverses orientation)
Scaling	$s_x s_y$	Yes (if $s_x, s_y \neq 0$ )	Angles (if uniform)
Shear	$1$	Yes	Area
Projection	$0$	No	Nothing fully

Composition of Transformations

Applying transformation $T_1$ (matrix $\mathbf{A}$ ) followed by $T_2$ (matrix $\mathbf{B}$ ) is the composition $T_2 \circ T_1$ , represented by the product $\mathbf{B}\mathbf{A}$ :

(T_2 \circ T_1)(\mathbf{x}) = \mathbf{B}(\mathbf{A}\mathbf{x}) = (\mathbf{B}\mathbf{A})\mathbf{x}

Warning: Note the order — $T_1$ is applied first but appears on the right in the matrix product. This is because matrix multiplication acts from right to left.

This is exactly what happens in a neural network: each layer applies a transformation, and the overall network is a composition of transformations (with nonlinearities between them).

The Image and Kernel

For a linear transformation $T: \mathbb{R}^n \to \mathbb{R}^m$ with matrix $\mathbf{A}$ :

The image (or range) is the set of all possible outputs:

\text{Im}(T) = \{\mathbf{A}\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} = C(\mathbf{A})

This is the column space of $\mathbf{A}$ .

The kernel (or null space) is the set of inputs that map to zero:

\ker(T) = \{\mathbf{x} : \mathbf{A}\mathbf{x} = \mathbf{0}\} = N(\mathbf{A})

The rank-nullity theorem connects them:

\dim(\text{Im}(T)) + \dim(\ker(T)) = n

Key insight: A transformation is injective (one-to-one) if and only if $\ker(T) = \{\mathbf{0}\}$ — nothing nonzero gets mapped to zero. It is surjective (onto) if and only if $\text{Im}(T) = \mathbb{R}^m$ — every output is reachable.

Change of Basis

A linear transformation is an intrinsic geometric operation, but its matrix representation depends on the choice of basis. If $\mathbf{A}$ is the matrix of $T$ in the standard basis and $\mathbf{P}$ is a change-of-basis matrix (columns are the new basis vectors), then the matrix of $T$ in the new basis is:

\mathbf{A}' = \mathbf{P}^{-1}\mathbf{A}\mathbf{P}

This operation is called a similarity transformation. Matrices related by similarity represent the same linear transformation in different coordinate systems.

Key insight: Diagonalization is finding a basis where the transformation is just scaling along each axis. If $\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}$ where $\mathbf{D}$ is diagonal, then in the basis defined by the columns of $\mathbf{P}$ , the transformation simply scales each coordinate independently.

Affine Transformations

In practice, ML models use affine transformations — a linear transformation plus a translation:

T(\mathbf{x}) = \mathbf{W}\mathbf{x} + \mathbf{b}

This is not linear (it does not preserve the origin) but it is the core computation of every fully connected neural network layer.

The translation $\mathbf{b}$ (the bias term) lets the model shift the decision boundary away from the origin. Without it, every hyperplane would pass through the origin — a severe limitation.

Homogeneous Coordinates

Affine transformations can be made linear by adding an extra dimension:

\begin{bmatrix} \mathbf{W} & \mathbf{b} \\ \mathbf{0}^T & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{W}\mathbf{x} + \mathbf{b} \\ 1 \end{bmatrix}

This trick is standard in computer graphics and sometimes appears in ML theory.

Linear Transformations in Neural Networks

A neural network layer computes:

\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})

The linear part $\mathbf{W}\mathbf{x} + \mathbf{b}$ is an affine transformation. The nonlinearity $\sigma$ (ReLU, sigmoid, etc.) is applied element-wise.

Why are nonlinearities essential? Because the composition of linear transformations is still linear:

\mathbf{W}_2(\mathbf{W}_1\mathbf{x}) = (\mathbf{W}_2\mathbf{W}_1)\mathbf{x} = \mathbf{W}'\mathbf{x}

Without nonlinearities, a 100-layer network would be equivalent to a single matrix multiplication. The nonlinearities break linearity and allow the network to learn complex, curved decision boundaries.

Key insight: Each layer of a neural network rotates, stretches, and shifts the data (linear/affine part), then bends and folds the space (nonlinearity). Stacking many such operations creates the complex mappings that make deep learning powerful.

Worked Example: Rotation Followed by Scaling

Apply a 45° rotation followed by scaling by 2 along the $x$ -axis and 0.5 along the $y$ -axis.

Rotation matrix:

\mathbf{R}_{45°} = \begin{bmatrix} \cos 45° & -\sin 45° \\ \sin 45° & \cos 45° \end{bmatrix} = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} \end{bmatrix}

Scaling matrix:

\mathbf{S} = \begin{bmatrix} 2 & 0 \\ 0 & 0.5 \end{bmatrix}

Combined transformation (scaling after rotation):

\mathbf{A} = \mathbf{S}\mathbf{R}_{45°} = \begin{bmatrix} \sqrt{2} & -\sqrt{2} \\ \frac{\sqrt{2}}{4} & \frac{\sqrt{2}}{4} \end{bmatrix}

import numpy as np

theta = np.pi / 4
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])

S = np.array([[2.0, 0.0],
              [0.0, 0.5]])

A = S @ R
print(A)
# [[ 1.414  -1.414]
#  [ 0.354   0.354]]

# Apply to a unit vector
x = np.array([1, 0])
print(A @ x)  # [1.414, 0.354]

Why This Matters for ML

Neural network layers are affine transformations followed by nonlinearities. Understanding what linear maps can do helps you understand network capacity.
Data preprocessing: Standardization, whitening, and PCA are linear transformations applied to data before training.
Attention mechanisms in Transformers use linear projections (query, key, value matrices) to transform token representations.
Convolutional layers are a special case of linear transformation with weight sharing and locality constraints.
Feature maps: Each layer’s weight matrix defines which linear combinations of input features to extract.

Summary

A linear transformation is a function that preserves linear combinations: $T(c_1\mathbf{u} + c_2\mathbf{v}) = c_1 T(\mathbf{u}) + c_2 T(\mathbf{v})$ .
Every linear transformation $\mathbb{R}^n \to \mathbb{R}^m$ corresponds to a unique $m \times n$ matrix.
Standard transformations include rotations, reflections, scaling, shear, and projections.
Composition of transformations corresponds to matrix multiplication (applied right to left).
The image is the column space (reachable outputs) and the kernel is the null space (inputs mapped to zero).
Change of basis gives different matrix representations of the same transformation.
Neural networks compose affine transformations with nonlinearities — without the nonlinearities, depth would be meaningless.
Next, we formalize angles and distances in inner products, norms, and orthogonality.

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
3Blue1Brown (2016). Essence of Linear Algebra. youtube.com/3blue1brown
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapters 2, 6. MIT Press. deeplearningbook.org