Derivatives and Differentiation: Measuring Rates of Change

Master the derivative from first principles — definition, rules, common functions, and how differentiation drives machine learning optimization.

Calculus & Optimization March 7, 2026 8 min read

The Derivative as a Limit

The derivative of a function ff at a point xx measures the instantaneous rate of change — how fast f(x)f(x) changes as xx changes by an infinitesimal amount. Building on the limits we just covered, the derivative is defined as:

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

The fraction f(x+h)f(x)h\frac{f(x+h) - f(x)}{h} is the slope of a secant line connecting two points on the curve. As h0h \to 0, the secant line becomes the tangent line, and its slope becomes the derivative.

Geometric Interpretation

The derivative f(a)f'(a) is the slope of the tangent line to y=f(x)y = f(x) at the point (a,f(a))(a, f(a)). The tangent line equation is:

y=f(a)+f(a)(xa)y = f(a) + f'(a)(x - a)

This is also the best linear approximation of ff near aa — a fact that becomes central in Taylor series and gradient descent.

Worked Example

Let us compute the derivative of f(x)=x2f(x) = x^2 from the definition:

f(x)=limh0(x+h)2x2h=limh0x2+2xh+h2x2h=limh02xh+h2h=limh0(2x+h)=2x\begin{aligned} f'(x) &= \lim_{h \to 0} \frac{(x + h)^2 - x^2}{h} \\[6pt] &= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} \\[6pt] &= \lim_{h \to 0} \frac{2xh + h^2}{h} \\[6pt] &= \lim_{h \to 0} (2x + h) \\[6pt] &= 2x \end{aligned}

At x=3x = 3, the slope is f(3)=6f'(3) = 6. The function is increasing, and the rate of increase itself increases with xx.

Notation

Several notations for the derivative are used interchangeably:

NotationRead asContext
f(x)f'(x)”f prime of x”General calculus
dfdx\frac{df}{dx}”df dx”Leibniz notation (emphasizes the variable)
ddxf(x)\frac{d}{dx}f(x)“d dx of f”Operator notation
x˙\dot{x}”x dot”Physics (time derivatives)
θL\nabla_\theta \mathcal{L}“grad L”ML (gradient w.r.t. parameters)

Differentiation Rules

Computing derivatives from the limit definition every time would be tedious. The following rules let us differentiate almost any function built from elementary pieces.

The Power Rule

For any real number nn:

ddxxn=nxn1\frac{d}{dx} x^n = n x^{n-1}

This handles polynomials, roots (x1/2x^{1/2}), and reciprocals (x1x^{-1}) in one stroke.

The Constant Multiple and Sum Rules

ddx[cf(x)]=cf(x)ddx[f(x)+g(x)]=f(x)+g(x)\frac{d}{dx}[c \cdot f(x)] = c \cdot f'(x) \qquad \frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

Differentiation is linear — it distributes over addition and pulls out constants.

The Product Rule

ddx[f(x)g(x)]=f(x)g(x)+f(x)g(x)\frac{d}{dx}[f(x) \cdot g(x)] = f'(x) \cdot g(x) + f(x) \cdot g'(x)

Example: ddx[x2sinx]=2xsinx+x2cosx\frac{d}{dx}[x^2 \sin x] = 2x \sin x + x^2 \cos x.

The Quotient Rule

ddx[f(x)g(x)]=f(x)g(x)f(x)g(x)[g(x)]2\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x) g(x) - f(x) g'(x)}{[g(x)]^2}

The Chain Rule (Preview)

For compositions f(g(x))f(g(x)):

ddxf(g(x))=f(g(x))g(x)\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

The chain rule is so important for ML that it gets its own article.

Derivatives of Common Functions

These derivatives appear constantly in ML derivations. Memorize them:

f(x)f(x)f(x)f'(x)ML relevance
xnx^nnxn1nx^{n-1}Polynomial features
exe^xexe^xSoftmax, attention
lnx\ln x1/x1/xLog-likelihood, cross-entropy
axa^xaxlnaa^x \ln aExponential decay
sinx\sin xcosx\cos xPositional encoding
cosx\cos xsinx-\sin xPositional encoding
σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}σ(x)(1σ(x))\sigma(x)(1 - \sigma(x))Sigmoid activation
tanh(x)\tanh(x)1tanh2(x)1 - \tanh^2(x)Tanh activation

The Exponential and Logarithm

The exponential function exe^x is uniquely its own derivative: ddxex=ex\frac{d}{dx} e^x = e^x. This self-referential property is why exe^x appears everywhere in probability (normal distribution, Poisson, softmax) and differential equations.

The natural logarithm lnx\ln x has derivative 1/x1/x. Since log-likelihoods are the foundation of maximum likelihood estimation, the derivative of ln\ln is arguably the most-used derivative in all of statistics.

The Sigmoid Function

The sigmoid σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}} maps any real number to (0,1)(0, 1), making it natural for probabilities. Its derivative has an elegant form:

σ(x)=ex(1+ex)2=11+exex1+ex=σ(x)(1σ(x))\begin{aligned} \sigma'(x) &= \frac{e^{-x}}{(1 + e^{-x})^2} \\[6pt] &= \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} \\[6pt] &= \sigma(x)(1 - \sigma(x)) \end{aligned}

Key insight: The sigmoid derivative is maximal at x=0x = 0 where σ(0)=0.25\sigma'(0) = 0.25, and approaches zero for large x|x|. This means gradients vanish for extreme inputs — the vanishing gradient problem that motivated ReLU and other modern activations.

Higher-Order Derivatives

The second derivative f(x)f''(x) is the derivative of the derivative — it measures how the rate of change itself changes.

  • f(x)>0f''(x) > 0: the function is concave up (curves upward like a bowl) — a local minimum
  • f(x)<0f''(x) < 0: the function is concave down (curves downward) — a local maximum
  • f(x)=0f''(x) = 0: inconclusive (could be an inflection point)

For a function of multiple variables, the second derivatives form the Hessian matrix, which captures curvature in every direction. We explore this in partial derivatives and gradients.

Differentiability and Smoothness

A function is differentiable at aa if f(a)f'(a) exists — meaning the limit in the definition converges to a finite value.

Differentiable implies continuous: If f(a)f'(a) exists, then ff is continuous at aa. The converse is false — continuous functions can have sharp corners where the derivative does not exist.

Corners and Cusps

The absolute value f(x)=xf(x) = |x| is continuous everywhere but not differentiable at x=0x = 0. The left-hand derivative is 1-1, the right-hand derivative is +1+1, and they disagree.

This is exactly the situation with ReLU: ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x). It has a corner at x=0x = 0 where the derivative is undefined. In practice, deep learning frameworks assign ReLU(0)=0\text{ReLU}'(0) = 0 (or sometimes 1) — a convention that works because the probability of any input being exactly zero is negligible.

Subgradients

For non-differentiable convex functions, the subgradient generalizes the derivative. A subgradient gg at xx satisfies:

f(y)f(x)+g(yx)yf(y) \geq f(x) + g(y - x) \quad \forall \, y

This is a supporting hyperplane. Subgradients allow optimization of non-smooth functions like the L1 norm w1\|w\|_1, which is used in Lasso regularization.

Implicit Differentiation

Sometimes a relationship between xx and yy is defined implicitly by an equation F(x,y)=0F(x, y) = 0 rather than explicitly as y=f(x)y = f(x).

Implicit differentiation differentiates both sides with respect to xx, treating yy as a function of xx and applying the chain rule:

Example: For the circle x2+y2=1x^2 + y^2 = 1, differentiate both sides:

2x+2ydydx=0    dydx=xy2x + 2y \frac{dy}{dx} = 0 \implies \frac{dy}{dx} = -\frac{x}{y}

This technique becomes essential in constrained optimization, where we optimize along implicitly defined constraint surfaces.

Worked Example: Finding and Classifying Critical Points

A critical point is where f(x)=0f'(x) = 0 or f(x)f'(x) is undefined. These are candidates for local minima and maxima.

Let f(x)=x33x+2f(x) = x^3 - 3x + 2.

Step 1 — Find critical points:

f(x)=3x23=3(x21)=3(x1)(x+1)f'(x) = 3x^2 - 3 = 3(x^2 - 1) = 3(x - 1)(x + 1)

Setting f(x)=0f'(x) = 0 gives x=1x = -1 and x=1x = 1.

Step 2 — Classify with the second derivative test:

f(x)=6xf''(x) = 6x f(1)=6<0    local maximum at x=1f(1)=6>0    local minimum at x=1\begin{aligned} f''(-1) &= -6 < 0 \implies \text{local maximum at } x = -1 \\[6pt] f''(1) &= 6 > 0 \implies \text{local minimum at } x = 1 \end{aligned}

Step 3 — Evaluate:

f(1)=1+3+2=4f(1)=13+2=0f(-1) = -1 + 3 + 2 = 4 \qquad f(1) = 1 - 3 + 2 = 0

The local maximum is (1,4)(−1, 4) and the local minimum is (1,0)(1, 0).

Key insight: This find-critical-points-then-classify procedure is the core idea behind all of optimization. In ML, the “function” is the loss, the “variable” is the parameter vector, and gradient descent automates the search for critical points.

Why This Matters for ML

Derivatives are the computational engine of machine learning:

  • Gradient = derivative of the loss with respect to model parameters. Training a neural network means computing Lθi\frac{\partial \mathcal{L}}{\partial \theta_i} for every parameter θi\theta_i.

  • Gradient descent moves parameters in the direction of L-\nabla \mathcal{L}, the direction that decreases the loss fastest. This is formalized in our gradient descent article.

  • Critical points of the loss function — where L=0\nabla \mathcal{L} = 0 — correspond to trained models. The second derivative (Hessian) determines whether these are minima, maxima, or saddle points.

  • Non-differentiable activations like ReLU work in practice thanks to subgradients and the measure-zero argument. Understanding differentiability helps you reason about when gradient-based training is valid.

Next, we extend derivatives to functions of many variables with partial derivatives and gradients.

Summary

  • The derivative f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x+h)-f(x)}{h} measures the instantaneous rate of change
  • Differentiation rules (power, product, quotient, chain) let us differentiate complex expressions efficiently
  • exe^x and lnx\ln x are the most important functions in ML — know their derivatives cold
  • The sigmoid derivative σ(x)(1σ(x))\sigma(x)(1-\sigma(x)) explains the vanishing gradient problem
  • The second derivative measures curvature: positive means concave up (minimum), negative means concave down (maximum)
  • Differentiability implies continuity, but not vice versa — ReLU is continuous but has a corner
  • Subgradients generalize derivatives to non-smooth functions like the L1 norm
  • Next: partial derivatives and gradients extend these ideas to multiple dimensions

References

  • Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning.
  • Spivak, M. (2008). Calculus (4th ed.). Publish or Perish.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org
  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay