Training a neural network means minimizing a loss function with respect to millions of parameters. To do that, you need gradients — derivatives of a scalar loss with respect to vectors and matrices of weights. This is matrix calculus: the extension of single-variable calculus to vectors and matrices.
If you understand how to differentiate f(x)=x2, matrix calculus extends that intuition to f(x)=xTAx and beyond. This article builds on the full linear algebra series and connects directly to how backpropagation computes gradients in deep learning.
Notation Conventions
Matrix calculus has two competing layout conventions. We use the denominator layout (also called the Jacobian formulation), which is standard in ML and matches PyTorch/TensorFlow behavior:
Expression
Input
Output
Derivative Shape
∂x∂f
x∈Rn
f∈R
Rn (column vector)
∂x∂f
x∈Rn
f∈Rm
Rm×n (Jacobian)
∂X∂f
X∈Rm×n
f∈R
Rm×n
Key insight: The gradient ∇xf has the same shape as x. The gradient ∇Xf has the same shape as X. This “shape consistency” rule is the most practical check when deriving gradients.
Gradients of Scalar Functions
Gradient with Respect to a Vector
For f:Rn→R, the gradient is the vector of partial derivatives:
∇xf=∂x∂f=∂x1∂f∂x2∂f⋮∂xn∂f
The gradient points in the direction of steepest ascent. Its negation is the direction of steepest descent — the direction gradient descent follows.
Essential Gradient Identities
These identities appear constantly in ML derivations. Let a,x∈Rn and A∈Rn×n:
Linear function: f(x)=aTx
∇x(aTx)=a
Quadratic form: f(x)=xTAx
∇x(xTAx)=(A+AT)x
If A is symmetric (A=AT), this simplifies to:
∇x(xTAx)=2Ax
Squared norm: f(x)=∥x∥22=xTx
∇x(xTx)=2x
Squared error: f(x)=∥Ax−b∥22
∇x∥Ax−b∥22=2AT(Ax−b)
Setting this to zero gives the normal equation ATAx=ATb — the foundation of linear regression.
Key insight: The gradient of the squared error loss is 2AT(Ax−b). This is exactly the gradient descent update rule for linear regression: x←x−αAT(Ax−b).
Quick Reference Table
f(x)
∇xf
aTx
a
xTx
2x
xTAx (symmetric A)
2Ax
∥Ax−b∥2
2AT(Ax−b)
log(aTx)
aTxa
σ(wTx) (sigmoid)
σ(wTx)(1−σ(wTx))w
The Jacobian Matrix
For a vector-valued function f:Rn→Rm, the Jacobian is the m×n matrix of all partial derivatives:
The Hessian is the Jacobian of the gradient. It is always symmetric (for smooth functions) because ∂xi∂xj∂2f=∂xj∂xi∂2f.
Hessian and Optimization
The Hessian captures the curvature of the loss landscape:
f(x+δ)≈f(x)+∇fTδ+21δTHδ
Hessian Property
Meaning
H≻0 (positive definite)
Local minimum — the function curves upward in all directions
H≺0 (negative definite)
Local maximum
H indefinite
Saddle point — curves up in some directions, down in others
The condition number of the Hessian κ(H)=λmax/λmin determines how fast gradient descent converges. A large condition number means the loss landscape is elongated (like a narrow valley), causing gradient descent to zigzag.
Key insight: Newton’s method uses the Hessian to take optimal steps: x←x−H−1∇f. This accounts for curvature and converges much faster than gradient descent — but computing H−1 costs O(n3), which is prohibitive for neural networks with millions of parameters.
Example: Hessian of Quadratic Form
For f(x)=21xTAx−bTx with symmetric A:
∇f=Ax−b,H=A
The Hessian is constant — quadratic functions have constant curvature. If A is positive definite, there is exactly one minimum at x∗=A−1b.
The Chain Rule for Vectors and Matrices
The chain rule is the backbone of backpropagation. For composed functions, it generalizes from scalars to vectors via Jacobians.
Scalar Chain Rule
dxdf(g(x))=f′(g(x))⋅g′(x)
Vector Chain Rule
If f:Rn→Rm and g:Rp→Rn, then the Jacobian of the composition is the product of Jacobians:
∂x∂f(g(x))=∂g∂f⋅∂x∂g=Jf⋅Jg
The dimensions work out: (m×n)⋅(n×p)=(m×p).
Backpropagation as Chain Rule
A neural network layer computes h=σ(Wx+b). Let z=Wx+b. Then:
∂W∂L=∂h∂L⋅∂z∂h⋅∂W∂z
Breaking this down:
∂h∂L — the upstream gradient (from the next layer)
∂z∂h=diag(σ′(z)) — the Jacobian of the activation (diagonal for element-wise activations)
∂W∂z — the Jacobian of the linear layer
The result is:
∂W∂L=δxT,where δ=∂z∂L=∂h∂L⊙σ′(z)
The gradient with respect to the weight matrix is an outer product of the error signal and the input.
Gradients with Respect to Matrices
Derivative of Trace Expressions
Many loss functions in ML can be written using the trace. The trace has convenient derivative rules:
Setting ∇wL=0 gives XTXw=XTy — the normal equation.
import numpy as npm, n = 100, 5X = np.random.randn(m, n)y = np.random.randn(m)w = np.zeros(n)lr = 0.01for _ in range(1000): grad = (1/m) * X.T @ (X @ w - y) # The derived gradient w -= lr * grad# Compare with closed-form solutionw_exact = np.linalg.solve(X.T @ X, X.T @ y)print(f"Max difference: {np.max(np.abs(w - w_exact)):.6f}")
Automatic Differentiation
In practice, you rarely compute gradients by hand. Modern frameworks use automatic differentiation (autodiff), which applies the chain rule mechanically.
There are two modes:
Forward mode: Computes ∂xi∂f for one input at a time. Efficient when outputs >> inputs.
Reverse mode: Computes ∂x∂f for one output at a time. Efficient when inputs >> outputs.
Since ML loss functions have one scalar output and millions of inputs, reverse mode (backpropagation) is used. One backward pass computes gradients with respect to all parameters simultaneously.
Key insight: Backpropagation is just reverse-mode automatic differentiation. Understanding matrix calculus tells you what is being computed. Autodiff handles the how efficiently — but when debugging gradient issues, knowing the math is essential.
Why This Matters for ML
Gradient descent requires ∇wL — computing this is a matrix calculus problem.
Backpropagation is the chain rule applied to composed functions through network layers.
The Hessian controls convergence speed and determines whether a critical point is a minimum or saddle point.
Newton’s method and quasi-Newton methods (L-BFGS, Adam) approximate or use Hessian information.
Normalizing flows require efficient computation of Jacobian determinants.
Natural gradient descent uses the Fisher information matrix (expected Hessian of the log-likelihood).
Summary
The gradient∇xf points in the direction of steepest ascent and has the same shape as x.
The Jacobian is the matrix of all first partial derivatives of a vector-valued function.
The Hessian is the matrix of second derivatives, encoding curvature of the loss landscape.
The chain rule for vectors is Jacobian multiplication — this is backpropagation.