Partial Derivatives and Gradients: Calculus in Multiple Dimensions

Learn partial derivatives, the gradient vector, directional derivatives, the Jacobian, and the Hessian — the multivariable toolkit for ML optimization.

Calculus & Optimization March 7, 2026 8 min read

From One Variable to Many

Real ML models do not have a single parameter — they have thousands or millions. A neural network’s loss is a function L(θ1,θ2,,θn)\mathcal{L}(\theta_1, \theta_2, \ldots, \theta_n) of all its weights and biases. To optimize this loss, we need calculus that works in Rn\mathbb{R}^n.

This article extends the derivative to functions of multiple variables, building toward the gradient — the single most important object in machine learning optimization.

Partial Derivatives

A partial derivative measures how a function changes when we vary one input while holding all others fixed.

For a function f(x,y)f(x, y), the partial derivative with respect to xx is:

fx=limh0f(x+h,y)f(x,y)h\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, \, y) - f(x, \, y)}{h}

We simply differentiate with respect to xx, treating yy as a constant.

Worked Example

Let f(x,y)=x2y+3xy25yf(x, y) = x^2 y + 3xy^2 - 5y.

fx=2xy+3y2fy=x2+6xy5\begin{aligned} \frac{\partial f}{\partial x} &= 2xy + 3y^2 \\[6pt] \frac{\partial f}{\partial y} &= x^2 + 6xy - 5 \end{aligned}

At the point (1,2)(1, 2):

fx(1,2)=2(1)(2)+3(4)=16fy(1,2)=1+125=8\frac{\partial f}{\partial x}\bigg|_{(1,2)} = 2(1)(2) + 3(4) = 16 \qquad \frac{\partial f}{\partial y}\bigg|_{(1,2)} = 1 + 12 - 5 = 8

Geometric interpretation: fx=16\frac{\partial f}{\partial x} = 16 means that if we stand at (1,2)(1, 2) and walk in the xx-direction, the function rises at a rate of 16 units per unit step. fy=8\frac{\partial f}{\partial y} = 8 means walking in the yy-direction raises the function at rate 8.

Higher-Order Partial Derivatives

We can differentiate partial derivatives again to get second-order partials:

2fx2,2fy2,2fxy,2fyx\frac{\partial^2 f}{\partial x^2}, \quad \frac{\partial^2 f}{\partial y^2}, \quad \frac{\partial^2 f}{\partial x \, \partial y}, \quad \frac{\partial^2 f}{\partial y \, \partial x}

Clairaut’s theorem (symmetry of mixed partials): If the mixed partials are continuous, they are equal:

2fxy=2fyx\frac{\partial^2 f}{\partial x \, \partial y} = \frac{\partial^2 f}{\partial y \, \partial x}

This symmetry is why the Hessian matrix (defined below) is symmetric — a property that has deep consequences for optimization.

The Gradient Vector

The gradient collects all partial derivatives into a single vector:

f(x)=[fx1fx2fxn]\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient is the multivariable generalization of the derivative. For f:RnRf: \mathbb{R}^n \to \mathbb{R}, f\nabla f is a vector in Rn\mathbb{R}^n.

The Gradient Points Uphill

The gradient f(x)\nabla f(\mathbf{x}) points in the direction of steepest ascent — the direction in which ff increases most rapidly. Its magnitude f\|\nabla f\| is the rate of that steepest increase.

Key insight: Gradient descent follows f-\nabla f, the direction of steepest descent. This is why the gradient is the central object in ML optimization: it tells us exactly which direction to move parameters to reduce the loss fastest.

Worked Example

For f(x,y)=x2+y2f(x, y) = x^2 + y^2 (a paraboloid centered at the origin):

f=[2x2y]\nabla f = \begin{bmatrix} 2x \\ 2y \end{bmatrix}

At (3,4)(3, 4), f=[6,8]T\nabla f = [6, 8]^T, pointing directly away from the origin. The magnitude is f=36+64=10\|\nabla f\| = \sqrt{36 + 64} = 10. Moving in the direction f=[6,8]T-\nabla f = [-6, -8]^T decreases ff most rapidly — heading straight toward the minimum at the origin.

At the minimum (0,0)(0, 0), f=[0,0]T\nabla f = [0, 0]^T. The gradient is zero at the optimum — this is the necessary condition for a minimum.

Directional Derivatives

The partial derivatives measure the rate of change along the coordinate axes. But what about an arbitrary direction?

The directional derivative of ff at x\mathbf{x} in the direction of a unit vector u\mathbf{u} is:

Duf(x)=f(x)u=fcosθD_\mathbf{u} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u} = \|\nabla f\| \cos \theta

where θ\theta is the angle between f\nabla f and u\mathbf{u}.

Key observations:

  • Maximum when u\mathbf{u} aligns with f\nabla f (θ=0\theta = 0): the function increases fastest in the gradient direction
  • Zero when u\mathbf{u} is perpendicular to f\nabla f (θ=90°\theta = 90°): moving along a level set (contour line)
  • Minimum when u\mathbf{u} opposes f\nabla f (θ=180°\theta = 180°): the steepest descent direction

Geometric interpretation: The gradient is perpendicular to level sets. In 2D, contour lines of f(x,y)=cf(x, y) = c are curves where ff is constant. The gradient at any point on a contour line points perpendicular to it, toward higher values. This is why gradient descent cuts across contour lines.

The Jacobian Matrix

When the function is vector-valued — f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m — partial derivatives are organized into the Jacobian matrix:

J=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn]\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Each row is the gradient of one output component. The Jacobian is the best linear approximation of f\mathbf{f} near a point:

f(x+h)f(x)+Jh\mathbf{f}(\mathbf{x} + \mathbf{h}) \approx \mathbf{f}(\mathbf{x}) + \mathbf{J} \mathbf{h}

The Jacobian appears throughout the chain rule and backpropagation, where we chain together Jacobians of successive layers. For a deeper treatment of matrix-level derivatives, see matrix calculus.

The Hessian Matrix

For a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian is the n×nn \times n matrix of all second-order partial derivatives:

H=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

By Clairaut’s theorem, H\mathbf{H} is symmetric (if second partials are continuous).

What the Hessian Tells Us

The Hessian captures the curvature of ff in every direction. At a critical point (f=0\nabla f = \mathbf{0}):

Hessian propertyConclusion
Positive definite (all eigenvalues >0> 0)Local minimum
Negative definite (all eigenvalues <0< 0)Local maximum
Indefinite (mixed signs)Saddle point
Singular (some eigenvalues =0= 0)Inconclusive

Key insight: In high-dimensional spaces (neural networks with millions of parameters), saddle points vastly outnumber local minima. At a saddle point, some Hessian eigenvalues are positive (the function curves up) and others are negative (curves down). The probability that all eigenvalues have the same sign decreases exponentially with dimension.

The Condition Number

The condition number of the Hessian, κ=λmax/λmin\kappa = \lambda_{\max} / \lambda_{\min}, measures how elongated the loss surface is. A large condition number means the curvature varies dramatically across directions, making optimization with a single learning rate difficult. This motivates adaptive methods that use different effective learning rates for different parameters.

Worked Example: Gradient of MSE Loss

Consider a simple linear model y^=w1x+w2\hat{y} = w_1 x + w_2 with MSE loss:

L(w1,w2)=1ni=1n(yiw1xiw2)2\mathcal{L}(w_1, w_2) = \frac{1}{n}\sum_{i=1}^{n} (y_i - w_1 x_i - w_2)^2

The partial derivatives are:

Lw1=1ni=1n2(yiw1xiw2)(xi)=2ni=1nxi(yiw1xiw2)Lw2=1ni=1n2(yiw1xiw2)(1)=2ni=1n(yiw1xiw2)\begin{aligned} \frac{\partial \mathcal{L}}{\partial w_1} &= \frac{1}{n}\sum_{i=1}^{n} 2(y_i - w_1 x_i - w_2)(-x_i) \\[6pt] &= -\frac{2}{n}\sum_{i=1}^{n} x_i(y_i - w_1 x_i - w_2) \\[10pt] \frac{\partial \mathcal{L}}{\partial w_2} &= \frac{1}{n}\sum_{i=1}^{n} 2(y_i - w_1 x_i - w_2)(-1) \\[6pt] &= -\frac{2}{n}\sum_{i=1}^{n} (y_i - w_1 x_i - w_2) \end{aligned}

The gradient vector is:

L=[Lw1Lw2]=2n[xi(yiw1xiw2)(yiw1xiw2)]\nabla \mathcal{L} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial w_1} \\ \frac{\partial \mathcal{L}}{\partial w_2} \end{bmatrix} = -\frac{2}{n} \begin{bmatrix} \sum x_i (y_i - w_1 x_i - w_2) \\ \sum (y_i - w_1 x_i - w_2) \end{bmatrix}

Setting L=0\nabla \mathcal{L} = \mathbf{0} and solving gives the normal equations — the closed-form solution for linear regression. When closed-form solutions are not available (as in neural networks), we follow L-\nabla \mathcal{L} iteratively using gradient descent.

Why This Matters for ML

Partial derivatives and gradients are the computational backbone of deep learning:

  • Every optimizer (SGD, Adam, RMSProp) is driven by the gradient θL\nabla_{\boldsymbol{\theta}} \mathcal{L}
  • Backpropagation computes gradients efficiently using the chain rule applied to computational graphs
  • The Hessian determines whether a critical point is a minimum, maximum, or saddle point — and its condition number predicts how hard optimization will be
  • The Jacobian connects layers in a neural network and appears in normalizing flows and other generative models
  • Directional derivatives explain why gradient descent takes the steepest path and why preconditioning (rescaling directions) can accelerate convergence

Summary

  • Partial derivatives measure how a function changes with respect to one variable, holding others fixed
  • The gradient f\nabla f collects all partial derivatives into a vector pointing in the direction of steepest ascent
  • Directional derivatives give the rate of change in any direction: Duf=fuD_\mathbf{u} f = \nabla f \cdot \mathbf{u}
  • The Jacobian generalizes the gradient to vector-valued functions — each row is a gradient
  • The Hessian captures second-order curvature; its eigenvalues classify critical points and determine optimization difficulty
  • The gradient of the MSE loss gives the update direction for linear regression
  • Next: the chain rule and computational graphs show how to compute gradients through compositions of functions

References

  • Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapters 14-15.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org
  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu
  • Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay