The Chain Rule and Computational Graphs: The Engine Behind Backpropagation

Calculus & Optimization Series 4 / 18

Why the Chain Rule is Everything

A neural network is a composition of functions — layer after layer of linear transformations and nonlinear activations. To train it, we need the derivative of the loss with respect to every parameter, which means differentiating through the entire composition.

The chain rule tells us exactly how to do this. It is the mathematical principle behind backpropagation, the algorithm that makes deep learning possible. Without the chain rule, we could not train networks with more than one layer.

This article builds on partial derivatives and gradients and connects directly to how frameworks like PyTorch and TensorFlow compute gradients.

The Single-Variable Chain Rule

If $y = f(g(x))$ , the derivative of the composition is:

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

where $u = g(x)$ . In function notation:

[f(g(x))]' = f'(g(x)) \cdot g'(x)

Intuition: If $u$ changes at rate $g'(x)$ with respect to $x$ , and $y$ changes at rate $f'(u)$ with respect to $u$ , then $y$ changes at rate $f'(u) \cdot g'(x)$ with respect to $x$ . Rates of change multiply through compositions.

Worked Examples

Example 1: $y = e^{x^2}$ . Let $u = x^2$ , so $y = e^u$ .

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = e^u \cdot 2x = 2x \, e^{x^2}

Example 2: $y = \ln(\sin x)$ . Let $u = \sin x$ , so $y = \ln u$ .

\frac{dy}{dx} = \frac{1}{u} \cdot \cos x = \frac{\cos x}{\sin x} = \cot x

Extended Chain Rule

For longer compositions $y = f(g(h(x)))$ , the chain rule extends naturally:

\frac{dy}{dx} = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)

Each link in the chain contributes a multiplicative factor. A neural network with $L$ layers is exactly this kind of long composition, and its gradient is a product of $L$ terms — which is why very deep networks suffer from vanishing or exploding gradients.

The Multivariable Chain Rule

When intermediate quantities depend on multiple inputs, the chain rule involves summation. If $f$ depends on $z_1, z_2, \ldots, z_k$ , each of which depends on $x$ , then:

\frac{\partial f}{\partial x} = \sum_{i=1}^{k} \frac{\partial f}{\partial z_i} \cdot \frac{\partial z_i}{\partial x}

This summation is the mathematical reason why gradients from different paths through a network add up at shared parameters.

Worked Example

Let $f(u, v) = u^2 + uv$ where $u = 2x + y$ and $v = x - 3y$ . Find $\frac{\partial f}{\partial x}$ :

\begin{aligned} \frac{\partial f}{\partial x} &= \frac{\partial f}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial f}{\partial v} \cdot \frac{\partial v}{\partial x} \\[6pt] &= (2u + v) \cdot 2 + u \cdot 1 \\[6pt] &= 2(2u + v) + u \\[6pt] &= 5u + 2v \end{aligned}

Substituting back: $5(2x + y) + 2(x - 3y) = 12x - y$ .

Key insight: When a variable influences the output through multiple paths, the total derivative is the sum of contributions from each path. This is the fundamental reason backpropagation works — it systematically accounts for all paths through the computational graph.

Computational Graphs

A computational graph represents a composite function as a directed acyclic graph (DAG). Each node performs a single operation, and edges carry values between operations.

Building a Graph

Consider the loss for a single training example with a linear model:

L = (wx + b - y)^2

The computational graph breaks this into elementary operations:

x, w --> [*] --> p = wx
p, b --> [+] --> q = wx + b
q, y --> [-] --> r = wx + b - y
r    --> [^2] --> L = r^2

Each node stores its output during the forward pass and its local derivative during the backward pass.

Forward Pass

The forward pass evaluates the function from inputs to output. With $w = 2$ , $x = 3$ , $b = 1$ , $y = 5$ :

\begin{aligned} p &= wx = 6 \\[6pt] q &= p + b = 7 \\[6pt] r &= q - y = 2 \\[6pt] L &= r^2 = 4 \end{aligned}

Backward Pass (Backpropagation)

The backward pass applies the chain rule in reverse, starting from the output and propagating gradients back to each input.

Each node computes: (incoming gradient) $\times$ (local derivative).

Step 1 — Start at the output:

\frac{\partial L}{\partial L} = 1

Step 2 — Through the squaring node ( $L = r^2$ , local derivative $2r$ ):

\frac{\partial L}{\partial r} = 1 \cdot 2r = 2(2) = 4

Step 3 — Through the subtraction node ( $r = q - y$ ):

\frac{\partial L}{\partial q} = 4 \cdot 1 = 4 \qquad \frac{\partial L}{\partial y} = 4 \cdot (-1) = -4

Step 4 — Through the addition node ( $q = p + b$ ):

\frac{\partial L}{\partial p} = 4 \cdot 1 = 4 \qquad \frac{\partial L}{\partial b} = 4 \cdot 1 = 4

Step 5 — Through the multiplication node ( $p = wx$ ):

\frac{\partial L}{\partial w} = 4 \cdot x = 4(3) = 12 \qquad \frac{\partial L}{\partial x} = 4 \cdot w = 4(2) = 8

We can verify: $L = (wx + b - y)^2$ , so $\frac{\partial L}{\partial w} = 2(wx + b - y) \cdot x = 2(2)(3) = 12$ . It matches.

Backpropagation Through a Neuron

A single neuron with sigmoid activation computes:

z = \mathbf{w}^T\mathbf{x} + b, \qquad a = \sigma(z), \qquad L = (a - y)^2

Forward pass (with $\mathbf{w} = [0.5, -0.3]^T$ , $\mathbf{x} = [1, 2]^T$ , $b = 0.1$ , $y = 1$ ):

\begin{aligned} z &= 0.5(1) + (-0.3)(2) + 0.1 = 0.0 \\[6pt] a &= \sigma(0) = 0.5 \\[6pt] L &= (0.5 - 1)^2 = 0.25 \end{aligned}

Backward pass:

\begin{aligned} \frac{\partial L}{\partial a} &= 2(a - y) = 2(0.5 - 1) = -1.0 \\[6pt] \frac{\partial L}{\partial z} &= \frac{\partial L}{\partial a} \cdot \sigma'(z) = (-1.0)(0.25) = -0.25 \\[6pt] \frac{\partial L}{\partial w_1} &= \frac{\partial L}{\partial z} \cdot x_1 = (-0.25)(1) = -0.25 \\[6pt] \frac{\partial L}{\partial w_2} &= \frac{\partial L}{\partial z} \cdot x_2 = (-0.25)(2) = -0.50 \\[6pt] \frac{\partial L}{\partial b} &= \frac{\partial L}{\partial z} \cdot 1 = -0.25 \end{aligned}

The negative gradients tell us to increase all three parameters to reduce the loss — which makes sense because the prediction $a = 0.5$ is below the target $y = 1$ .

Automatic Differentiation

Modern deep learning frameworks do not require manual derivation of gradients. They use automatic differentiation (autodiff), which mechanically applies the chain rule to computational graphs.

Forward Mode vs Reverse Mode

There are two ways to propagate derivatives through a graph:

	Forward mode	Reverse mode
Direction	Input $\to$ output	Output $\to$ input
Computes	Jacobian-vector product $\mathbf{J}\mathbf{v}$	Vector-Jacobian product $\mathbf{v}^T\mathbf{J}$
Cost per pass	One pass per input	One pass per output
Efficient when	Few inputs, many outputs	Many inputs, few outputs

Key insight: Neural network training has many inputs (millions of parameters) and one output (scalar loss). This makes reverse mode autodiff — which is exactly backpropagation — the efficient choice. One backward pass gives the gradient with respect to all parameters simultaneously.

Forward mode would require a separate pass for each parameter — millions of passes versus one. This asymmetry is why backpropagation was such a breakthrough.

How Frameworks Implement Autodiff

PyTorch and TensorFlow build the computational graph dynamically (PyTorch) or statically (TensorFlow 1.x). Each operation registers:

Its output tensor
A reference to the backward function
References to input tensors

When .backward() is called, the framework traverses the graph in reverse topological order, calling each backward function and accumulating gradients.

Common Gradient Patterns

Certain operations appear so frequently that their local gradients are worth memorizing:

Operation	Forward	Local gradient (backward)
Addition: $c = a + b$	$c = a + b$	$\frac{\partial c}{\partial a} = 1, \; \frac{\partial c}{\partial b} = 1$
Multiplication: $c = a \cdot b$	$c = ab$	$\frac{\partial c}{\partial a} = b, \; \frac{\partial c}{\partial b} = a$
ReLU: $c = \max(0, a)$	$c = \max(0, a)$	$\frac{\partial c}{\partial a} = \mathbb{1}[a > 0]$
Sigmoid: $c = \sigma(a)$	$c = \sigma(a)$	$\frac{\partial c}{\partial a} = c(1-c)$
Matrix multiply: $\mathbf{C} = \mathbf{A}\mathbf{B}$	$\mathbf{C} = \mathbf{AB}$	$\frac{\partial L}{\partial \mathbf{A}} = \frac{\partial L}{\partial \mathbf{C}}\mathbf{B}^T$

Notice that addition distributes the gradient equally, multiplication swaps and scales, and ReLU acts as a gate (passes or blocks the gradient). See matrix calculus for more on matrix-level derivatives.

Vanishing and Exploding Gradients

The chain rule multiplies local gradients together. For a network with $L$ layers:

\frac{\partial L}{\partial \mathbf{W}_1} = \frac{\partial L}{\partial \mathbf{a}_L} \cdot \prod_{\ell=2}^{L} \frac{\partial \mathbf{a}_\ell}{\partial \mathbf{a}_{\ell-1}} \cdot \frac{\partial \mathbf{a}_1}{\partial \mathbf{W}_1}

If each factor $\|\frac{\partial \mathbf{a}_\ell}{\partial \mathbf{a}_{\ell-1}}\| < 1$ , the product shrinks exponentially — vanishing gradients. If each factor $> 1$ , it grows exponentially — exploding gradients.

Mitigations include:

ReLU (gradient is exactly 0 or 1, no shrinkage for active neurons)
Residual connections ( $\mathbf{a}_\ell = \mathbf{a}_{\ell-1} + f(\mathbf{a}_{\ell-1})$ , which adds an identity term to the gradient product)
Gradient clipping (cap the gradient norm to prevent explosions)
Careful initialization (Xavier/He initialization scales weights to preserve gradient magnitude)

Why This Matters for ML

The chain rule and computational graphs are the foundation of modern deep learning:

Backpropagation is the chain rule applied to computational graphs — nothing more, nothing less
Autodiff in PyTorch/TensorFlow mechanizes this, freeing practitioners from manual gradient derivation
Reverse mode is efficient because neural networks map many parameters to one scalar loss
Vanishing/exploding gradients are direct consequences of the multiplicative chain rule — understanding this guides architectural choices (ReLU, residual connections, normalization)
These gradients feed into gradient descent to update parameters

Summary

The chain rule for compositions: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ — rates of change multiply
The multivariable chain rule sums contributions from all paths: $\frac{\partial f}{\partial x} = \sum_i \frac{\partial f}{\partial z_i} \frac{\partial z_i}{\partial x}$
Computational graphs decompose functions into elementary operations, enabling systematic gradient computation
Backpropagation = reverse-mode autodiff = chain rule applied backwards through the graph
Reverse mode is efficient for many-inputs-to-one-output (the neural network training setting)
Vanishing/exploding gradients arise from multiplying many factors in the chain rule
Next: Taylor series and approximation show how derivatives yield local function models

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 6.5. deeplearningbook.org
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM.
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: A Survey. JMLR, 18(153), 1-43.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323, 533-536.