Partial Derivatives and Gradients: Calculus in Multiple Dimensions

Calculus & Optimization Series 3 / 18

From One Variable to Many

Real ML models do not have a single parameter — they have thousands or millions. A neural network’s loss is a function $\mathcal{L}(\theta_1, \theta_2, \ldots, \theta_n)$ of all its weights and biases. To optimize this loss, we need calculus that works in $\mathbb{R}^n$ .

This article extends the derivative to functions of multiple variables, building toward the gradient — the single most important object in machine learning optimization.

Partial Derivatives

A partial derivative measures how a function changes when we vary one input while holding all others fixed.

For a function $f(x, y)$ , the partial derivative with respect to $x$ is:

\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, \, y) - f(x, \, y)}{h}

We simply differentiate with respect to $x$ , treating $y$ as a constant.

Worked Example

Let $f(x, y) = x^2 y + 3xy^2 - 5y$ .

\begin{aligned} \frac{\partial f}{\partial x} &= 2xy + 3y^2 \\[6pt] \frac{\partial f}{\partial y} &= x^2 + 6xy - 5 \end{aligned}

At the point $(1, 2)$ :

\frac{\partial f}{\partial x}\bigg|_{(1,2)} = 2(1)(2) + 3(4) = 16 \qquad \frac{\partial f}{\partial y}\bigg|_{(1,2)} = 1 + 12 - 5 = 8

Geometric interpretation: $\frac{\partial f}{\partial x} = 16$ means that if we stand at $(1, 2)$ and walk in the $x$ -direction, the function rises at a rate of 16 units per unit step. $\frac{\partial f}{\partial y} = 8$ means walking in the $y$ -direction raises the function at rate 8.

Higher-Order Partial Derivatives

We can differentiate partial derivatives again to get second-order partials:

\frac{\partial^2 f}{\partial x^2}, \quad \frac{\partial^2 f}{\partial y^2}, \quad \frac{\partial^2 f}{\partial x \, \partial y}, \quad \frac{\partial^2 f}{\partial y \, \partial x}

Clairaut’s theorem (symmetry of mixed partials): If the mixed partials are continuous, they are equal:

\frac{\partial^2 f}{\partial x \, \partial y} = \frac{\partial^2 f}{\partial y \, \partial x}

This symmetry is why the Hessian matrix (defined below) is symmetric — a property that has deep consequences for optimization.

The Gradient Vector

The gradient collects all partial derivatives into a single vector:

\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient is the multivariable generalization of the derivative. For $f: \mathbb{R}^n \to \mathbb{R}$ , $\nabla f$ is a vector in $\mathbb{R}^n$ .

The Gradient Points Uphill

The gradient $\nabla f(\mathbf{x})$ points in the direction of steepest ascent — the direction in which $f$ increases most rapidly. Its magnitude $\|\nabla f\|$ is the rate of that steepest increase.

Key insight: Gradient descent follows $-\nabla f$ , the direction of steepest descent. This is why the gradient is the central object in ML optimization: it tells us exactly which direction to move parameters to reduce the loss fastest.

Worked Example

For $f(x, y) = x^2 + y^2$ (a paraboloid centered at the origin):

\nabla f = \begin{bmatrix} 2x \\ 2y \end{bmatrix}

At $(3, 4)$ , $\nabla f = [6, 8]^T$ , pointing directly away from the origin. The magnitude is $\|\nabla f\| = \sqrt{36 + 64} = 10$ . Moving in the direction $-\nabla f = [-6, -8]^T$ decreases $f$ most rapidly — heading straight toward the minimum at the origin.

At the minimum $(0, 0)$ , $\nabla f = [0, 0]^T$ . The gradient is zero at the optimum — this is the necessary condition for a minimum.

Directional Derivatives

The partial derivatives measure the rate of change along the coordinate axes. But what about an arbitrary direction?

The directional derivative of $f$ at $\mathbf{x}$ in the direction of a unit vector $\mathbf{u}$ is:

D_\mathbf{u} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u} = \|\nabla f\| \cos \theta

where $\theta$ is the angle between $\nabla f$ and $\mathbf{u}$ .

Key observations:

Maximum when $\mathbf{u}$ aligns with $\nabla f$ ( $\theta = 0$ ): the function increases fastest in the gradient direction
Zero when $\mathbf{u}$ is perpendicular to $\nabla f$ ( $\theta = 90°$ ): moving along a level set (contour line)
Minimum when $\mathbf{u}$ opposes $\nabla f$ ( $\theta = 180°$ ): the steepest descent direction

Geometric interpretation: The gradient is perpendicular to level sets. In 2D, contour lines of $f(x, y) = c$ are curves where $f$ is constant. The gradient at any point on a contour line points perpendicular to it, toward higher values. This is why gradient descent cuts across contour lines.

The Jacobian Matrix

When the function is vector-valued — $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ — partial derivatives are organized into the Jacobian matrix:

\mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Each row is the gradient of one output component. The Jacobian is the best linear approximation of $\mathbf{f}$ near a point:

\mathbf{f}(\mathbf{x} + \mathbf{h}) \approx \mathbf{f}(\mathbf{x}) + \mathbf{J} \mathbf{h}

The Jacobian appears throughout the chain rule and backpropagation, where we chain together Jacobians of successive layers. For a deeper treatment of matrix-level derivatives, see matrix calculus.

The Hessian Matrix

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ , the Hessian is the $n \times n$ matrix of all second-order partial derivatives:

\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

By Clairaut’s theorem, $\mathbf{H}$ is symmetric (if second partials are continuous).

What the Hessian Tells Us

The Hessian captures the curvature of $f$ in every direction. At a critical point ( $\nabla f = \mathbf{0}$ ):

Hessian property	Conclusion
Positive definite (all eigenvalues $> 0$ )	Local minimum
Negative definite (all eigenvalues $< 0$ )	Local maximum
Indefinite (mixed signs)	Saddle point
Singular (some eigenvalues $= 0$ )	Inconclusive

Key insight: In high-dimensional spaces (neural networks with millions of parameters), saddle points vastly outnumber local minima. At a saddle point, some Hessian eigenvalues are positive (the function curves up) and others are negative (curves down). The probability that all eigenvalues have the same sign decreases exponentially with dimension.

The Condition Number

The condition number of the Hessian, $\kappa = \lambda_{\max} / \lambda_{\min}$ , measures how elongated the loss surface is. A large condition number means the curvature varies dramatically across directions, making optimization with a single learning rate difficult. This motivates adaptive methods that use different effective learning rates for different parameters.

Worked Example: Gradient of MSE Loss

Consider a simple linear model $\hat{y} = w_1 x + w_2$ with MSE loss:

\mathcal{L}(w_1, w_2) = \frac{1}{n}\sum_{i=1}^{n} (y_i - w_1 x_i - w_2)^2

The partial derivatives are:

\begin{aligned} \frac{\partial \mathcal{L}}{\partial w_1} &= \frac{1}{n}\sum_{i=1}^{n} 2(y_i - w_1 x_i - w_2)(-x_i) \\[6pt] &= -\frac{2}{n}\sum_{i=1}^{n} x_i(y_i - w_1 x_i - w_2) \\[10pt] \frac{\partial \mathcal{L}}{\partial w_2} &= \frac{1}{n}\sum_{i=1}^{n} 2(y_i - w_1 x_i - w_2)(-1) \\[6pt] &= -\frac{2}{n}\sum_{i=1}^{n} (y_i - w_1 x_i - w_2) \end{aligned}

The gradient vector is:

\nabla \mathcal{L} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial w_1} \\ \frac{\partial \mathcal{L}}{\partial w_2} \end{bmatrix} = -\frac{2}{n} \begin{bmatrix} \sum x_i (y_i - w_1 x_i - w_2) \\ \sum (y_i - w_1 x_i - w_2) \end{bmatrix}

Setting $\nabla \mathcal{L} = \mathbf{0}$ and solving gives the normal equations — the closed-form solution for linear regression. When closed-form solutions are not available (as in neural networks), we follow $-\nabla \mathcal{L}$ iteratively using gradient descent.

Why This Matters for ML

Partial derivatives and gradients are the computational backbone of deep learning:

Every optimizer (SGD, Adam, RMSProp) is driven by the gradient $\nabla_{\boldsymbol{\theta}} \mathcal{L}$
Backpropagation computes gradients efficiently using the chain rule applied to computational graphs
The Hessian determines whether a critical point is a minimum, maximum, or saddle point — and its condition number predicts how hard optimization will be
The Jacobian connects layers in a neural network and appears in normalizing flows and other generative models
Directional derivatives explain why gradient descent takes the steepest path and why preconditioning (rescaling directions) can accelerate convergence

Summary

Partial derivatives measure how a function changes with respect to one variable, holding others fixed
The gradient $\nabla f$ collects all partial derivatives into a vector pointing in the direction of steepest ascent
Directional derivatives give the rate of change in any direction: $D_\mathbf{u} f = \nabla f \cdot \mathbf{u}$
The Jacobian generalizes the gradient to vector-valued functions — each row is a gradient
The Hessian captures second-order curvature; its eigenvalues classify critical points and determine optimization difficulty
The gradient of the MSE loss gives the update direction for linear regression
Next: the chain rule and computational graphs show how to compute gradients through compositions of functions

References

Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapters 14-15.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu
Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.

Partial Derivatives and Gradients: Calculus in Multiple Dimensions

From One Variable to Many

Partial Derivatives

Worked Example

Higher-Order Partial Derivatives

The Gradient Vector

The Gradient Points Uphill

Worked Example

Directional Derivatives

The Jacobian Matrix

The Hessian Matrix

What the Hessian Tells Us

The Condition Number

Worked Example: Gradient of MSE Loss

Why This Matters for ML

Summary

References

Keyboard Shortcuts