Matrix Calculus: Derivatives for Machine Learning

Master gradients, Jacobians, Hessians, and the chain rule for vectors and matrices — the math that makes backpropagation work.

Linear Algebra March 6, 2026 10 min read

Introduction

Training a neural network means minimizing a loss function with respect to millions of parameters. To do that, you need gradients — derivatives of a scalar loss with respect to vectors and matrices of weights. This is matrix calculus: the extension of single-variable calculus to vectors and matrices.

If you understand how to differentiate f(x)=x2f(x) = x^2, matrix calculus extends that intuition to f(x)=xTAxf(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x} and beyond. This article builds on the full linear algebra series and connects directly to how backpropagation computes gradients in deep learning.

Notation Conventions

Matrix calculus has two competing layout conventions. We use the denominator layout (also called the Jacobian formulation), which is standard in ML and matches PyTorch/TensorFlow behavior:

ExpressionInputOutputDerivative Shape
fx\frac{\partial f}{\partial \mathbf{x}}xRn\mathbf{x} \in \mathbb{R}^nfRf \in \mathbb{R}Rn\mathbb{R}^n (column vector)
fx\frac{\partial \mathbf{f}}{\partial \mathbf{x}}xRn\mathbf{x} \in \mathbb{R}^nfRm\mathbf{f} \in \mathbb{R}^mRm×n\mathbb{R}^{m \times n} (Jacobian)
fX\frac{\partial f}{\partial \mathbf{X}}XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}fRf \in \mathbb{R}Rm×n\mathbb{R}^{m \times n}

Key insight: The gradient xf\nabla_\mathbf{x} f has the same shape as x\mathbf{x}. The gradient Xf\nabla_\mathbf{X} f has the same shape as X\mathbf{X}. This “shape consistency” rule is the most practical check when deriving gradients.

Gradients of Scalar Functions

Gradient with Respect to a Vector

For f:RnRf: \mathbb{R}^n \to \mathbb{R}, the gradient is the vector of partial derivatives:

xf=fx=[fx1fx2fxn]\nabla_\mathbf{x} f = \frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient points in the direction of steepest ascent. Its negation is the direction of steepest descent — the direction gradient descent follows.

Essential Gradient Identities

These identities appear constantly in ML derivations. Let a,xRn\mathbf{a}, \mathbf{x} \in \mathbb{R}^n and ARn×n\mathbf{A} \in \mathbb{R}^{n \times n}:

Linear function: f(x)=aTxf(\mathbf{x}) = \mathbf{a}^T\mathbf{x}

x(aTx)=a\nabla_\mathbf{x} (\mathbf{a}^T\mathbf{x}) = \mathbf{a}

Quadratic form: f(x)=xTAxf(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x}

x(xTAx)=(A+AT)x\nabla_\mathbf{x} (\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}

If A\mathbf{A} is symmetric (A=AT\mathbf{A} = \mathbf{A}^T), this simplifies to:

x(xTAx)=2Ax\nabla_\mathbf{x} (\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}

Squared norm: f(x)=x22=xTxf(\mathbf{x}) = \|\mathbf{x}\|_2^2 = \mathbf{x}^T\mathbf{x}

x(xTx)=2x\nabla_\mathbf{x} (\mathbf{x}^T\mathbf{x}) = 2\mathbf{x}

Squared error: f(x)=Axb22f(\mathbf{x}) = \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2

xAxb22=2AT(Axb)\nabla_\mathbf{x} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2 = 2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})

Setting this to zero gives the normal equation ATAx=ATb\mathbf{A}^T\mathbf{A}\mathbf{x} = \mathbf{A}^T\mathbf{b} — the foundation of linear regression.

Key insight: The gradient of the squared error loss is 2AT(Axb)2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b}). This is exactly the gradient descent update rule for linear regression: xxαAT(Axb)\mathbf{x} \leftarrow \mathbf{x} - \alpha \mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b}).

Quick Reference Table

f(x)f(\mathbf{x})xf\nabla_\mathbf{x} f
aTx\mathbf{a}^T\mathbf{x}a\mathbf{a}
xTx\mathbf{x}^T\mathbf{x}2x2\mathbf{x}
xTAx\mathbf{x}^T\mathbf{A}\mathbf{x} (symmetric A\mathbf{A})2Ax2\mathbf{A}\mathbf{x}
Axb2\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^22AT(Axb)2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})
log(aTx)\log(\mathbf{a}^T\mathbf{x})aaTx\frac{\mathbf{a}}{\mathbf{a}^T\mathbf{x}}
σ(wTx)\sigma(\mathbf{w}^T\mathbf{x}) (sigmoid)σ(wTx)(1σ(wTx))w\sigma(\mathbf{w}^T\mathbf{x})(1-\sigma(\mathbf{w}^T\mathbf{x}))\mathbf{w}

The Jacobian Matrix

For a vector-valued function f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, the Jacobian is the m×nm \times n matrix of all partial derivatives:

J=fx=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn]\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\[6pt] \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\[6pt] \vdots & \vdots & \ddots & \vdots \\[6pt] \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Each row is the gradient of one output component. The Jacobian is the best linear approximation to f\mathbf{f} near a point:

f(x+δ)f(x)+Jδ\mathbf{f}(\mathbf{x} + \boldsymbol{\delta}) \approx \mathbf{f}(\mathbf{x}) + \mathbf{J}\boldsymbol{\delta}

Important Jacobians

Linear function: f(x)=Ax\mathbf{f}(\mathbf{x}) = \mathbf{A}\mathbf{x}

J=A\mathbf{J} = \mathbf{A}

The Jacobian of a linear function is the matrix itself — linear functions are their own best linear approximation.

Element-wise function: f(x)=[g(x1),g(x2),,g(xn)]T\mathbf{f}(\mathbf{x}) = [g(x_1), g(x_2), \ldots, g(x_n)]^T

J=diag(g(x1),g(x2),,g(xn))\mathbf{J} = \text{diag}(g'(x_1), g'(x_2), \ldots, g'(x_n))

This is why element-wise activation functions (ReLU, sigmoid) produce diagonal Jacobians — making them computationally cheap.

Softmax: fi(z)=ezijezjf_i(\mathbf{z}) = \frac{e^{z_i}}{\sum_j e^{z_j}}

fizj=fi(δijfj)\frac{\partial f_i}{\partial z_j} = f_i(\delta_{ij} - f_j)

where δij\delta_{ij} is the Kronecker delta. The Jacobian is J=diag(f)ffT\mathbf{J} = \text{diag}(\mathbf{f}) - \mathbf{f}\mathbf{f}^T.

The Jacobian Determinant

The absolute value of det(J)\det(\mathbf{J}) measures how the function locally scales volumes. This connects back to determinants and is crucial in:

  • Change of variables in probability: py(y)=px(x)det(J1)p_y(y) = p_x(x) \cdot |\det(\mathbf{J}^{-1})|
  • Normalizing flows: Generative models that use invertible transformations and track the Jacobian determinant to compute exact likelihoods

The Hessian Matrix

For f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian is the n×nn \times n matrix of second partial derivatives:

H=2f=[2fx122fx1x22fx2x12fx22]\mathbf{H} = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\[6pt] \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\[6pt] \vdots & \vdots & \ddots \end{bmatrix}

The Hessian is the Jacobian of the gradient. It is always symmetric (for smooth functions) because 2fxixj=2fxjxi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}.

Hessian and Optimization

The Hessian captures the curvature of the loss landscape:

f(x+δ)f(x)+fTδ+12δTHδf(\mathbf{x} + \boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f^T \boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^T \mathbf{H} \boldsymbol{\delta}
Hessian PropertyMeaning
H0\mathbf{H} \succ 0 (positive definite)Local minimum — the function curves upward in all directions
H0\mathbf{H} \prec 0 (negative definite)Local maximum
H\mathbf{H} indefiniteSaddle point — curves up in some directions, down in others

The condition number of the Hessian κ(H)=λmax/λmin\kappa(\mathbf{H}) = \lambda_{\max}/\lambda_{\min} determines how fast gradient descent converges. A large condition number means the loss landscape is elongated (like a narrow valley), causing gradient descent to zigzag.

Key insight: Newton’s method uses the Hessian to take optimal steps: xxH1f\mathbf{x} \leftarrow \mathbf{x} - \mathbf{H}^{-1}\nabla f. This accounts for curvature and converges much faster than gradient descent — but computing H1\mathbf{H}^{-1} costs O(n3)O(n^3), which is prohibitive for neural networks with millions of parameters.

Example: Hessian of Quadratic Form

For f(x)=12xTAxbTxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x} - \mathbf{b}^T\mathbf{x} with symmetric A\mathbf{A}:

f=Axb,H=A\nabla f = \mathbf{A}\mathbf{x} - \mathbf{b}, \qquad \mathbf{H} = \mathbf{A}

The Hessian is constant — quadratic functions have constant curvature. If A\mathbf{A} is positive definite, there is exactly one minimum at x=A1b\mathbf{x}^* = \mathbf{A}^{-1}\mathbf{b}.

The Chain Rule for Vectors and Matrices

The chain rule is the backbone of backpropagation. For composed functions, it generalizes from scalars to vectors via Jacobians.

Scalar Chain Rule

ddxf(g(x))=f(g(x))g(x)\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

Vector Chain Rule

If f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m and g:RpRn\mathbf{g}: \mathbb{R}^p \to \mathbb{R}^n, then the Jacobian of the composition is the product of Jacobians:

f(g(x))x=fggx=JfJg\frac{\partial \mathbf{f}(\mathbf{g}(\mathbf{x}))}{\partial \mathbf{x}} = \frac{\partial \mathbf{f}}{\partial \mathbf{g}} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}} = \mathbf{J}_f \cdot \mathbf{J}_g

The dimensions work out: (m×n)(n×p)=(m×p)(m \times n) \cdot (n \times p) = (m \times p).

Backpropagation as Chain Rule

A neural network layer computes h=σ(Wx+b)\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b}). Let z=Wx+b\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}. Then:

LW=LhhzzW\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{W}}

Breaking this down:

  1. Lh\frac{\partial \mathcal{L}}{\partial \mathbf{h}} — the upstream gradient (from the next layer)
  2. hz=diag(σ(z))\frac{\partial \mathbf{h}}{\partial \mathbf{z}} = \text{diag}(\sigma'(\mathbf{z})) — the Jacobian of the activation (diagonal for element-wise activations)
  3. zW\frac{\partial \mathbf{z}}{\partial \mathbf{W}} — the Jacobian of the linear layer

The result is:

LW=δxT,where δ=Lz=Lhσ(z)\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \boldsymbol{\delta} \mathbf{x}^T, \quad \text{where } \boldsymbol{\delta} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}} \odot \sigma'(\mathbf{z})

The gradient with respect to the weight matrix is an outer product of the error signal and the input.

Gradients with Respect to Matrices

Derivative of Trace Expressions

Many loss functions in ML can be written using the trace. The trace has convenient derivative rules:

Xtr(AX)=ATXtr(XTA)=AXtr(XTAX)=(A+AT)XXtr(AXB)=ATBT\begin{aligned} \frac{\partial}{\partial \mathbf{X}} \text{tr}(\mathbf{A}\mathbf{X}) &= \mathbf{A}^T \\[6pt] \frac{\partial}{\partial \mathbf{X}} \text{tr}(\mathbf{X}^T\mathbf{A}) &= \mathbf{A} \\[6pt] \frac{\partial}{\partial \mathbf{X}} \text{tr}(\mathbf{X}^T\mathbf{A}\mathbf{X}) &= (\mathbf{A} + \mathbf{A}^T)\mathbf{X} \\[6pt] \frac{\partial}{\partial \mathbf{X}} \text{tr}(\mathbf{A}\mathbf{X}\mathbf{B}) &= \mathbf{A}^T\mathbf{B}^T \end{aligned}

Derivative of Log-Determinant

For a positive definite matrix X\mathbf{X}:

Xlogdet(X)=XT\frac{\partial}{\partial \mathbf{X}} \log\det(\mathbf{X}) = \mathbf{X}^{-T}

If X\mathbf{X} is symmetric: Xlogdet(X)=X1\frac{\partial}{\partial \mathbf{X}} \log\det(\mathbf{X}) = \mathbf{X}^{-1}.

This appears in maximum likelihood estimation for the multivariate Gaussian, where the log-likelihood involves logdet(Σ)\log\det(\boldsymbol{\Sigma}).

Worked Example: Deriving the Linear Regression Gradient

Derive the gradient of the mean squared error loss:

L(w)=12mXwy22\mathcal{L}(\mathbf{w}) = \frac{1}{2m}\|\mathbf{X}\mathbf{w} - \mathbf{y}\|_2^2

Let r=Xwy\mathbf{r} = \mathbf{X}\mathbf{w} - \mathbf{y} be the residual. Then L=12mrTr\mathcal{L} = \frac{1}{2m}\mathbf{r}^T\mathbf{r}.

wL=12mw(Xwy)T(Xwy)=12mw(wTXTXw2yTXw+yTy)=12m(2XTXw2XTy)=1mXT(Xwy)\begin{aligned} \nabla_\mathbf{w} \mathcal{L} &= \frac{1}{2m} \nabla_\mathbf{w} (\mathbf{X}\mathbf{w} - \mathbf{y})^T(\mathbf{X}\mathbf{w} - \mathbf{y}) \\[6pt] &= \frac{1}{2m} \nabla_\mathbf{w} (\mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w} - 2\mathbf{y}^T\mathbf{X}\mathbf{w} + \mathbf{y}^T\mathbf{y}) \\[6pt] &= \frac{1}{2m} (2\mathbf{X}^T\mathbf{X}\mathbf{w} - 2\mathbf{X}^T\mathbf{y}) \\[6pt] &= \frac{1}{m}\mathbf{X}^T(\mathbf{X}\mathbf{w} - \mathbf{y}) \end{aligned}

Setting wL=0\nabla_\mathbf{w}\mathcal{L} = \mathbf{0} gives XTXw=XTy\mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y} — the normal equation.

import numpy as np

m, n = 100, 5
X = np.random.randn(m, n)
y = np.random.randn(m)
w = np.zeros(n)

lr = 0.01
for _ in range(1000):
    grad = (1/m) * X.T @ (X @ w - y)  # The derived gradient
    w -= lr * grad

# Compare with closed-form solution
w_exact = np.linalg.solve(X.T @ X, X.T @ y)
print(f"Max difference: {np.max(np.abs(w - w_exact)):.6f}")

Automatic Differentiation

In practice, you rarely compute gradients by hand. Modern frameworks use automatic differentiation (autodiff), which applies the chain rule mechanically.

There are two modes:

Forward mode: Computes fxi\frac{\partial \mathbf{f}}{\partial x_i} for one input at a time. Efficient when outputs >> inputs.

Reverse mode: Computes fx\frac{\partial f}{\partial \mathbf{x}} for one output at a time. Efficient when inputs >> outputs.

Since ML loss functions have one scalar output and millions of inputs, reverse mode (backpropagation) is used. One backward pass computes gradients with respect to all parameters simultaneously.

Key insight: Backpropagation is just reverse-mode automatic differentiation. Understanding matrix calculus tells you what is being computed. Autodiff handles the how efficiently — but when debugging gradient issues, knowing the math is essential.

Why This Matters for ML

  • Gradient descent requires wL\nabla_\mathbf{w} \mathcal{L} — computing this is a matrix calculus problem.
  • Backpropagation is the chain rule applied to composed functions through network layers.
  • The Hessian controls convergence speed and determines whether a critical point is a minimum or saddle point.
  • Newton’s method and quasi-Newton methods (L-BFGS, Adam) approximate or use Hessian information.
  • Normalizing flows require efficient computation of Jacobian determinants.
  • Natural gradient descent uses the Fisher information matrix (expected Hessian of the log-likelihood).

Summary

  • The gradient xf\nabla_\mathbf{x} f points in the direction of steepest ascent and has the same shape as x\mathbf{x}.
  • The Jacobian is the matrix of all first partial derivatives of a vector-valued function.
  • The Hessian is the matrix of second derivatives, encoding curvature of the loss landscape.
  • The chain rule for vectors is Jacobian multiplication — this is backpropagation.
  • Key identities: (aTx)=a\nabla(\mathbf{a}^T\mathbf{x}) = \mathbf{a}, (xTAx)=2Ax\nabla(\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x} (symmetric A\mathbf{A}), Axb2=2AT(Axb)\nabla\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = 2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b}).
  • Automatic differentiation (reverse mode = backpropagation) computes these gradients efficiently.
  • Next, we extend linear algebra to higher dimensions with tensor operations.

References

  • Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. matrixcookbook.com
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 4, 6. MIT Press. deeplearningbook.org
  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu/~boyd/cvxbook
  • Baydin, A. G., et al. (2018). Automatic Differentiation in Machine Learning: a Survey. arXiv:1502.05767

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay