Derivatives and Differentiation: Measuring Rates of Change

Calculus & Optimization Series 2 / 18

The Derivative as a Limit

The derivative of a function $f$ at a point $x$ measures the instantaneous rate of change — how fast $f(x)$ changes as $x$ changes by an infinitesimal amount. Building on the limits we just covered, the derivative is defined as:

f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

The fraction $\frac{f(x+h) - f(x)}{h}$ is the slope of a secant line connecting two points on the curve. As $h \to 0$ , the secant line becomes the tangent line, and its slope becomes the derivative.

Geometric Interpretation

The derivative $f'(a)$ is the slope of the tangent line to $y = f(x)$ at the point $(a, f(a))$ . The tangent line equation is:

y = f(a) + f'(a)(x - a)

This is also the best linear approximation of $f$ near $a$ — a fact that becomes central in Taylor series and gradient descent.

Worked Example

Let us compute the derivative of $f(x) = x^2$ from the definition:

\begin{aligned} f'(x) &= \lim_{h \to 0} \frac{(x + h)^2 - x^2}{h} \\[6pt] &= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} \\[6pt] &= \lim_{h \to 0} \frac{2xh + h^2}{h} \\[6pt] &= \lim_{h \to 0} (2x + h) \\[6pt] &= 2x \end{aligned}

At $x = 3$ , the slope is $f'(3) = 6$ . The function is increasing, and the rate of increase itself increases with $x$ .

Notation

Several notations for the derivative are used interchangeably:

Notation	Read as	Context
$f'(x)$	”f prime of x”	General calculus
$\frac{df}{dx}$	”df dx”	Leibniz notation (emphasizes the variable)
$\frac{d}{dx}f(x)$	“d dx of f”	Operator notation
$\dot{x}$	”x dot”	Physics (time derivatives)
$\nabla_\theta \mathcal{L}$	“grad L”	ML (gradient w.r.t. parameters)

Differentiation Rules

Computing derivatives from the limit definition every time would be tedious. The following rules let us differentiate almost any function built from elementary pieces.

The Power Rule

For any real number $n$ :

\frac{d}{dx} x^n = n x^{n-1}

This handles polynomials, roots ( $x^{1/2}$ ), and reciprocals ( $x^{-1}$ ) in one stroke.

The Constant Multiple and Sum Rules

\frac{d}{dx}[c \cdot f(x)] = c \cdot f'(x) \qquad \frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

Differentiation is linear — it distributes over addition and pulls out constants.

The Product Rule

\frac{d}{dx}[f(x) \cdot g(x)] = f'(x) \cdot g(x) + f(x) \cdot g'(x)

Example: $\frac{d}{dx}[x^2 \sin x] = 2x \sin x + x^2 \cos x$ .

The Quotient Rule

\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x) g(x) - f(x) g'(x)}{[g(x)]^2}

The Chain Rule (Preview)

For compositions $f(g(x))$ :

\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

The chain rule is so important for ML that it gets its own article.

Derivatives of Common Functions

These derivatives appear constantly in ML derivations. Memorize them:

$f(x)$	$f'(x)$	ML relevance
$x^n$	$nx^{n-1}$	Polynomial features
$e^x$	$e^x$	Softmax, attention
$\ln x$	$1/x$	Log-likelihood, cross-entropy
$a^x$	$a^x \ln a$	Exponential decay
$\sin x$	$\cos x$	Positional encoding
$\cos x$	$-\sin x$	Positional encoding
$\sigma(x) = \frac{1}{1+e^{-x}}$	$\sigma(x)(1 - \sigma(x))$	Sigmoid activation
$\tanh(x)$	$1 - \tanh^2(x)$	Tanh activation

The Exponential and Logarithm

The exponential function $e^x$ is uniquely its own derivative: $\frac{d}{dx} e^x = e^x$ . This self-referential property is why $e^x$ appears everywhere in probability (normal distribution, Poisson, softmax) and differential equations.

The natural logarithm $\ln x$ has derivative $1/x$ . Since log-likelihoods are the foundation of maximum likelihood estimation, the derivative of $\ln$ is arguably the most-used derivative in all of statistics.

The Sigmoid Function

The sigmoid $\sigma(x) = \frac{1}{1 + e^{-x}}$ maps any real number to $(0, 1)$ , making it natural for probabilities. Its derivative has an elegant form:

\begin{aligned} \sigma'(x) &= \frac{e^{-x}}{(1 + e^{-x})^2} \\[6pt] &= \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} \\[6pt] &= \sigma(x)(1 - \sigma(x)) \end{aligned}

Key insight: The sigmoid derivative is maximal at $x = 0$ where $\sigma'(0) = 0.25$ , and approaches zero for large $|x|$ . This means gradients vanish for extreme inputs — the vanishing gradient problem that motivated ReLU and other modern activations.

Higher-Order Derivatives

The second derivative $f''(x)$ is the derivative of the derivative — it measures how the rate of change itself changes.

$f''(x) > 0$ : the function is concave up (curves upward like a bowl) — a local minimum
$f''(x) < 0$ : the function is concave down (curves downward) — a local maximum
$f''(x) = 0$ : inconclusive (could be an inflection point)

For a function of multiple variables, the second derivatives form the Hessian matrix, which captures curvature in every direction. We explore this in partial derivatives and gradients.

Differentiability and Smoothness

A function is differentiable at $a$ if $f'(a)$ exists — meaning the limit in the definition converges to a finite value.

Differentiable implies continuous: If $f'(a)$ exists, then $f$ is continuous at $a$ . The converse is false — continuous functions can have sharp corners where the derivative does not exist.

Corners and Cusps

The absolute value $f(x) = |x|$ is continuous everywhere but not differentiable at $x = 0$ . The left-hand derivative is $-1$ , the right-hand derivative is $+1$ , and they disagree.

This is exactly the situation with ReLU: $\text{ReLU}(x) = \max(0, x)$ . It has a corner at $x = 0$ where the derivative is undefined. In practice, deep learning frameworks assign $\text{ReLU}'(0) = 0$ (or sometimes 1) — a convention that works because the probability of any input being exactly zero is negligible.

Subgradients

For non-differentiable convex functions, the subgradient generalizes the derivative. A subgradient $g$ at $x$ satisfies:

f(y) \geq f(x) + g(y - x) \quad \forall \, y

This is a supporting hyperplane. Subgradients allow optimization of non-smooth functions like the L1 norm $\|w\|_1$ , which is used in Lasso regularization.

Implicit Differentiation

Sometimes a relationship between $x$ and $y$ is defined implicitly by an equation $F(x, y) = 0$ rather than explicitly as $y = f(x)$ .

Implicit differentiation differentiates both sides with respect to $x$ , treating $y$ as a function of $x$ and applying the chain rule:

Example: For the circle $x^2 + y^2 = 1$ , differentiate both sides:

$2x + 2y \frac{dy}{dx} = 0 \implies \frac{dy}{dx} = -\frac{x}{y}$

This technique becomes essential in constrained optimization, where we optimize along implicitly defined constraint surfaces.

Worked Example: Finding and Classifying Critical Points

A critical point is where $f'(x) = 0$ or $f'(x)$ is undefined. These are candidates for local minima and maxima.

Let $f(x) = x^3 - 3x + 2$ .

Step 1 — Find critical points:

f'(x) = 3x^2 - 3 = 3(x^2 - 1) = 3(x - 1)(x + 1)

Setting $f'(x) = 0$ gives $x = -1$ and $x = 1$ .

Step 2 — Classify with the second derivative test:

f''(x) = 6x

\begin{aligned} f''(-1) &= -6 < 0 \implies \text{local maximum at } x = -1 \\[6pt] f''(1) &= 6 > 0 \implies \text{local minimum at } x = 1 \end{aligned}

Step 3 — Evaluate:

f(-1) = -1 + 3 + 2 = 4 \qquad f(1) = 1 - 3 + 2 = 0

The local maximum is $(−1, 4)$ and the local minimum is $(1, 0)$ .

Key insight: This find-critical-points-then-classify procedure is the core idea behind all of optimization. In ML, the “function” is the loss, the “variable” is the parameter vector, and gradient descent automates the search for critical points.

Why This Matters for ML

Derivatives are the computational engine of machine learning:

Gradient = derivative of the loss with respect to model parameters. Training a neural network means computing $\frac{\partial \mathcal{L}}{\partial \theta_i}$ for every parameter $\theta_i$ .
Gradient descent moves parameters in the direction of $-\nabla \mathcal{L}$ , the direction that decreases the loss fastest. This is formalized in our gradient descent article.
Critical points of the loss function — where $\nabla \mathcal{L} = 0$ — correspond to trained models. The second derivative (Hessian) determines whether these are minima, maxima, or saddle points.
Non-differentiable activations like ReLU work in practice thanks to subgradients and the measure-zero argument. Understanding differentiability helps you reason about when gradient-based training is valid.

Next, we extend derivatives to functions of many variables with partial derivatives and gradients.

Summary

The derivative $f'(x) = \lim_{h \to 0} \frac{f(x+h)-f(x)}{h}$ measures the instantaneous rate of change
Differentiation rules (power, product, quotient, chain) let us differentiate complex expressions efficiently
$e^x$ and $\ln x$ are the most important functions in ML — know their derivatives cold
The sigmoid derivative $\sigma(x)(1-\sigma(x))$ explains the vanishing gradient problem
The second derivative measures curvature: positive means concave up (minimum), negative means concave down (maximum)
Differentiability implies continuity, but not vice versa — ReLU is continuous but has a corner
Subgradients generalize derivatives to non-smooth functions like the L1 norm
Next: partial derivatives and gradients extend these ideas to multiple dimensions

References

Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning.
Spivak, M. (2008). Calculus (4th ed.). Publish or Perish.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu