- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Why the Chain Rule is Everything
A neural network is a composition of functions — layer after layer of linear transformations and nonlinear activations. To train it, we need the derivative of the loss with respect to every parameter, which means differentiating through the entire composition.
The chain rule tells us exactly how to do this. It is the mathematical principle behind backpropagation, the algorithm that makes deep learning possible. Without the chain rule, we could not train networks with more than one layer.
This article builds on partial derivatives and gradients and connects directly to how frameworks like PyTorch and TensorFlow compute gradients.
The Single-Variable Chain Rule
If , the derivative of the composition is:
where . In function notation:
Intuition: If changes at rate with respect to , and changes at rate with respect to , then changes at rate with respect to . Rates of change multiply through compositions.
Worked Examples
Example 1: . Let , so .
Example 2: . Let , so .
Extended Chain Rule
For longer compositions , the chain rule extends naturally:
Each link in the chain contributes a multiplicative factor. A neural network with layers is exactly this kind of long composition, and its gradient is a product of terms — which is why very deep networks suffer from vanishing or exploding gradients.
The Multivariable Chain Rule
When intermediate quantities depend on multiple inputs, the chain rule involves summation. If depends on , each of which depends on , then:
This summation is the mathematical reason why gradients from different paths through a network add up at shared parameters.
Worked Example
Let where and . Find :
Substituting back: .
Key insight: When a variable influences the output through multiple paths, the total derivative is the sum of contributions from each path. This is the fundamental reason backpropagation works — it systematically accounts for all paths through the computational graph.
Computational Graphs
A computational graph represents a composite function as a directed acyclic graph (DAG). Each node performs a single operation, and edges carry values between operations.
Building a Graph
Consider the loss for a single training example with a linear model:
The computational graph breaks this into elementary operations:
x, w --> [*] --> p = wx
p, b --> [+] --> q = wx + b
q, y --> [-] --> r = wx + b - y
r --> [^2] --> L = r^2
Each node stores its output during the forward pass and its local derivative during the backward pass.
Forward Pass
The forward pass evaluates the function from inputs to output. With , , , :
Backward Pass (Backpropagation)
The backward pass applies the chain rule in reverse, starting from the output and propagating gradients back to each input.
Each node computes: (incoming gradient) (local derivative).
Step 1 — Start at the output:
Step 2 — Through the squaring node (, local derivative ):
Step 3 — Through the subtraction node ():
Step 4 — Through the addition node ():
Step 5 — Through the multiplication node ():
We can verify: , so . It matches.
Backpropagation Through a Neuron
A single neuron with sigmoid activation computes:
Forward pass (with , , , ):
Backward pass:
The negative gradients tell us to increase all three parameters to reduce the loss — which makes sense because the prediction is below the target .
Automatic Differentiation
Modern deep learning frameworks do not require manual derivation of gradients. They use automatic differentiation (autodiff), which mechanically applies the chain rule to computational graphs.
Forward Mode vs Reverse Mode
There are two ways to propagate derivatives through a graph:
| Forward mode | Reverse mode | |
|---|---|---|
| Direction | Input output | Output input |
| Computes | Jacobian-vector product | Vector-Jacobian product |
| Cost per pass | One pass per input | One pass per output |
| Efficient when | Few inputs, many outputs | Many inputs, few outputs |
Key insight: Neural network training has many inputs (millions of parameters) and one output (scalar loss). This makes reverse mode autodiff — which is exactly backpropagation — the efficient choice. One backward pass gives the gradient with respect to all parameters simultaneously.
Forward mode would require a separate pass for each parameter — millions of passes versus one. This asymmetry is why backpropagation was such a breakthrough.
How Frameworks Implement Autodiff
PyTorch and TensorFlow build the computational graph dynamically (PyTorch) or statically (TensorFlow 1.x). Each operation registers:
- Its output tensor
- A reference to the backward function
- References to input tensors
When .backward() is called, the framework traverses the graph in reverse topological order, calling each backward function and accumulating gradients.
Common Gradient Patterns
Certain operations appear so frequently that their local gradients are worth memorizing:
| Operation | Forward | Local gradient (backward) |
|---|---|---|
| Addition: | ||
| Multiplication: | ||
| ReLU: | ||
| Sigmoid: | ||
| Matrix multiply: |
Notice that addition distributes the gradient equally, multiplication swaps and scales, and ReLU acts as a gate (passes or blocks the gradient). See matrix calculus for more on matrix-level derivatives.
Vanishing and Exploding Gradients
The chain rule multiplies local gradients together. For a network with layers:
If each factor , the product shrinks exponentially — vanishing gradients. If each factor , it grows exponentially — exploding gradients.
Mitigations include:
- ReLU (gradient is exactly 0 or 1, no shrinkage for active neurons)
- Residual connections (, which adds an identity term to the gradient product)
- Gradient clipping (cap the gradient norm to prevent explosions)
- Careful initialization (Xavier/He initialization scales weights to preserve gradient magnitude)
Why This Matters for ML
The chain rule and computational graphs are the foundation of modern deep learning:
- Backpropagation is the chain rule applied to computational graphs — nothing more, nothing less
- Autodiff in PyTorch/TensorFlow mechanizes this, freeing practitioners from manual gradient derivation
- Reverse mode is efficient because neural networks map many parameters to one scalar loss
- Vanishing/exploding gradients are direct consequences of the multiplicative chain rule — understanding this guides architectural choices (ReLU, residual connections, normalization)
- These gradients feed into gradient descent to update parameters
Summary
- The chain rule for compositions: — rates of change multiply
- The multivariable chain rule sums contributions from all paths:
- Computational graphs decompose functions into elementary operations, enabling systematic gradient computation
- Backpropagation = reverse-mode autodiff = chain rule applied backwards through the graph
- Reverse mode is efficient for many-inputs-to-one-output (the neural network training setting)
- Vanishing/exploding gradients arise from multiplying many factors in the chain rule
- Next: Taylor series and approximation show how derivatives yield local function models
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 6.5. deeplearningbook.org
- Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM.
- Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: A Survey. JMLR, 18(153), 1-43.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323, 533-536.