- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
From One Variable to Many
Real ML models do not have a single parameter — they have thousands or millions. A neural network’s loss is a function of all its weights and biases. To optimize this loss, we need calculus that works in .
This article extends the derivative to functions of multiple variables, building toward the gradient — the single most important object in machine learning optimization.
Partial Derivatives
A partial derivative measures how a function changes when we vary one input while holding all others fixed.
For a function , the partial derivative with respect to is:
We simply differentiate with respect to , treating as a constant.
Worked Example
Let .
At the point :
Geometric interpretation: means that if we stand at and walk in the -direction, the function rises at a rate of 16 units per unit step. means walking in the -direction raises the function at rate 8.
Higher-Order Partial Derivatives
We can differentiate partial derivatives again to get second-order partials:
Clairaut’s theorem (symmetry of mixed partials): If the mixed partials are continuous, they are equal:
This symmetry is why the Hessian matrix (defined below) is symmetric — a property that has deep consequences for optimization.
The Gradient Vector
The gradient collects all partial derivatives into a single vector:
The gradient is the multivariable generalization of the derivative. For , is a vector in .
The Gradient Points Uphill
The gradient points in the direction of steepest ascent — the direction in which increases most rapidly. Its magnitude is the rate of that steepest increase.
Key insight: Gradient descent follows , the direction of steepest descent. This is why the gradient is the central object in ML optimization: it tells us exactly which direction to move parameters to reduce the loss fastest.
Worked Example
For (a paraboloid centered at the origin):
At , , pointing directly away from the origin. The magnitude is . Moving in the direction decreases most rapidly — heading straight toward the minimum at the origin.
At the minimum , . The gradient is zero at the optimum — this is the necessary condition for a minimum.
Directional Derivatives
The partial derivatives measure the rate of change along the coordinate axes. But what about an arbitrary direction?
The directional derivative of at in the direction of a unit vector is:
where is the angle between and .
Key observations:
- Maximum when aligns with (): the function increases fastest in the gradient direction
- Zero when is perpendicular to (): moving along a level set (contour line)
- Minimum when opposes (): the steepest descent direction
Geometric interpretation: The gradient is perpendicular to level sets. In 2D, contour lines of are curves where is constant. The gradient at any point on a contour line points perpendicular to it, toward higher values. This is why gradient descent cuts across contour lines.
The Jacobian Matrix
When the function is vector-valued — — partial derivatives are organized into the Jacobian matrix:
Each row is the gradient of one output component. The Jacobian is the best linear approximation of near a point:
The Jacobian appears throughout the chain rule and backpropagation, where we chain together Jacobians of successive layers. For a deeper treatment of matrix-level derivatives, see matrix calculus.
The Hessian Matrix
For a scalar function , the Hessian is the matrix of all second-order partial derivatives:
By Clairaut’s theorem, is symmetric (if second partials are continuous).
What the Hessian Tells Us
The Hessian captures the curvature of in every direction. At a critical point ():
| Hessian property | Conclusion |
|---|---|
| Positive definite (all eigenvalues ) | Local minimum |
| Negative definite (all eigenvalues ) | Local maximum |
| Indefinite (mixed signs) | Saddle point |
| Singular (some eigenvalues ) | Inconclusive |
Key insight: In high-dimensional spaces (neural networks with millions of parameters), saddle points vastly outnumber local minima. At a saddle point, some Hessian eigenvalues are positive (the function curves up) and others are negative (curves down). The probability that all eigenvalues have the same sign decreases exponentially with dimension.
The Condition Number
The condition number of the Hessian, , measures how elongated the loss surface is. A large condition number means the curvature varies dramatically across directions, making optimization with a single learning rate difficult. This motivates adaptive methods that use different effective learning rates for different parameters.
Worked Example: Gradient of MSE Loss
Consider a simple linear model with MSE loss:
The partial derivatives are:
The gradient vector is:
Setting and solving gives the normal equations — the closed-form solution for linear regression. When closed-form solutions are not available (as in neural networks), we follow iteratively using gradient descent.
Why This Matters for ML
Partial derivatives and gradients are the computational backbone of deep learning:
- Every optimizer (SGD, Adam, RMSProp) is driven by the gradient
- Backpropagation computes gradients efficiently using the chain rule applied to computational graphs
- The Hessian determines whether a critical point is a minimum, maximum, or saddle point — and its condition number predicts how hard optimization will be
- The Jacobian connects layers in a neural network and appears in normalizing flows and other generative models
- Directional derivatives explain why gradient descent takes the steepest path and why preconditioning (rescaling directions) can accelerate convergence
Summary
- Partial derivatives measure how a function changes with respect to one variable, holding others fixed
- The gradient collects all partial derivatives into a vector pointing in the direction of steepest ascent
- Directional derivatives give the rate of change in any direction:
- The Jacobian generalizes the gradient to vector-valued functions — each row is a gradient
- The Hessian captures second-order curvature; its eigenvalues classify critical points and determine optimization difficulty
- The gradient of the MSE loss gives the update direction for linear regression
- Next: the chain rule and computational graphs show how to compute gradients through compositions of functions
References
- Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapters 14-15.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu
- Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.