- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
The Derivative as a Limit
The derivative of a function at a point measures the instantaneous rate of change — how fast changes as changes by an infinitesimal amount. Building on the limits we just covered, the derivative is defined as:
The fraction is the slope of a secant line connecting two points on the curve. As , the secant line becomes the tangent line, and its slope becomes the derivative.
Geometric Interpretation
The derivative is the slope of the tangent line to at the point . The tangent line equation is:
This is also the best linear approximation of near — a fact that becomes central in Taylor series and gradient descent.
Worked Example
Let us compute the derivative of from the definition:
At , the slope is . The function is increasing, and the rate of increase itself increases with .
Notation
Several notations for the derivative are used interchangeably:
| Notation | Read as | Context |
|---|---|---|
| ”f prime of x” | General calculus | |
| ”df dx” | Leibniz notation (emphasizes the variable) | |
| “d dx of f” | Operator notation | |
| ”x dot” | Physics (time derivatives) | |
| “grad L” | ML (gradient w.r.t. parameters) |
Differentiation Rules
Computing derivatives from the limit definition every time would be tedious. The following rules let us differentiate almost any function built from elementary pieces.
The Power Rule
For any real number :
This handles polynomials, roots (), and reciprocals () in one stroke.
The Constant Multiple and Sum Rules
Differentiation is linear — it distributes over addition and pulls out constants.
The Product Rule
Example: .
The Quotient Rule
The Chain Rule (Preview)
For compositions :
The chain rule is so important for ML that it gets its own article.
Derivatives of Common Functions
These derivatives appear constantly in ML derivations. Memorize them:
| ML relevance | ||
|---|---|---|
| Polynomial features | ||
| Softmax, attention | ||
| Log-likelihood, cross-entropy | ||
| Exponential decay | ||
| Positional encoding | ||
| Positional encoding | ||
| Sigmoid activation | ||
| Tanh activation |
The Exponential and Logarithm
The exponential function is uniquely its own derivative: . This self-referential property is why appears everywhere in probability (normal distribution, Poisson, softmax) and differential equations.
The natural logarithm has derivative . Since log-likelihoods are the foundation of maximum likelihood estimation, the derivative of is arguably the most-used derivative in all of statistics.
The Sigmoid Function
The sigmoid maps any real number to , making it natural for probabilities. Its derivative has an elegant form:
Key insight: The sigmoid derivative is maximal at where , and approaches zero for large . This means gradients vanish for extreme inputs — the vanishing gradient problem that motivated ReLU and other modern activations.
Higher-Order Derivatives
The second derivative is the derivative of the derivative — it measures how the rate of change itself changes.
- : the function is concave up (curves upward like a bowl) — a local minimum
- : the function is concave down (curves downward) — a local maximum
- : inconclusive (could be an inflection point)
For a function of multiple variables, the second derivatives form the Hessian matrix, which captures curvature in every direction. We explore this in partial derivatives and gradients.
Differentiability and Smoothness
A function is differentiable at if exists — meaning the limit in the definition converges to a finite value.
Differentiable implies continuous: If exists, then is continuous at . The converse is false — continuous functions can have sharp corners where the derivative does not exist.
Corners and Cusps
The absolute value is continuous everywhere but not differentiable at . The left-hand derivative is , the right-hand derivative is , and they disagree.
This is exactly the situation with ReLU: . It has a corner at where the derivative is undefined. In practice, deep learning frameworks assign (or sometimes 1) — a convention that works because the probability of any input being exactly zero is negligible.
Subgradients
For non-differentiable convex functions, the subgradient generalizes the derivative. A subgradient at satisfies:
This is a supporting hyperplane. Subgradients allow optimization of non-smooth functions like the L1 norm , which is used in Lasso regularization.
Implicit Differentiation
Sometimes a relationship between and is defined implicitly by an equation rather than explicitly as .
Implicit differentiation differentiates both sides with respect to , treating as a function of and applying the chain rule:
Example: For the circle , differentiate both sides:
This technique becomes essential in constrained optimization, where we optimize along implicitly defined constraint surfaces.
Worked Example: Finding and Classifying Critical Points
A critical point is where or is undefined. These are candidates for local minima and maxima.
Let .
Step 1 — Find critical points:
Setting gives and .
Step 2 — Classify with the second derivative test:
Step 3 — Evaluate:
The local maximum is and the local minimum is .
Key insight: This find-critical-points-then-classify procedure is the core idea behind all of optimization. In ML, the “function” is the loss, the “variable” is the parameter vector, and gradient descent automates the search for critical points.
Why This Matters for ML
Derivatives are the computational engine of machine learning:
-
Gradient = derivative of the loss with respect to model parameters. Training a neural network means computing for every parameter .
-
Gradient descent moves parameters in the direction of , the direction that decreases the loss fastest. This is formalized in our gradient descent article.
-
Critical points of the loss function — where — correspond to trained models. The second derivative (Hessian) determines whether these are minima, maxima, or saddle points.
-
Non-differentiable activations like ReLU work in practice thanks to subgradients and the measure-zero argument. Understanding differentiability helps you reason about when gradient-based training is valid.
Next, we extend derivatives to functions of many variables with partial derivatives and gradients.
Summary
- The derivative measures the instantaneous rate of change
- Differentiation rules (power, product, quotient, chain) let us differentiate complex expressions efficiently
- and are the most important functions in ML — know their derivatives cold
- The sigmoid derivative explains the vanishing gradient problem
- The second derivative measures curvature: positive means concave up (minimum), negative means concave down (maximum)
- Differentiability implies continuity, but not vice versa — ReLU is continuous but has a corner
- Subgradients generalize derivatives to non-smooth functions like the L1 norm
- Next: partial derivatives and gradients extend these ideas to multiple dimensions
References
- Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning.
- Spivak, M. (2008). Calculus (4th ed.). Publish or Perish.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu