- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
The Idea of Local Approximation
Complex functions can be approximated by simpler ones near a given point. This is not just a mathematical curiosity — it is the foundation of all gradient-based optimization. Gradient descent uses a linear approximation. Newton’s method uses a quadratic approximation. Understanding these approximations reveals why optimization algorithms behave the way they do.
This article connects the derivatives and chain rule we have developed to the optimization algorithms that follow.
Taylor Series in One Variable
The Taylor series of a function centered at is the infinite polynomial:
Expanded:
When , this is called the Maclaurin series.
Key Expansions
These Taylor series appear throughout ML derivations:
Key insight: for small . This approximation is used constantly — from the definition of the Poisson distribution to the analysis of learning rate effects in optimization.
Taylor’s Theorem and the Remainder
The -th order Taylor polynomial uses only the first terms. The error of this approximation is bounded by Taylor’s theorem:
where the Lagrange remainder is:
for some between and . The error is proportional to — the approximation gets better as we move closer to or use more terms.
Linear Approximation (First-Order Taylor)
Truncating the Taylor series after the first-order term gives:
This is the tangent line at — the best linear approximation of near .
Multivariable Linearization
For :
This first-order approximation is exactly what gradient descent uses. At each step, we approximate the loss as a linear function of the parameters and move in the direction that decreases this linear model.
Key insight: Gradient descent trusts the linear approximation within a small region around the current point. The learning rate controls the size of this region. If the learning rate is too large, we step beyond where the linear approximation is accurate, and the algorithm can diverge.
How Good is the Linear Approximation?
For a function with bounded second derivative , the linearization error is:
The error grows quadratically with distance from . This is why gradient descent with a small step size converges but a large step size can overshoot — the linear approximation breaks down far from the current point.
Quadratic Approximation (Second-Order Taylor)
Including the second-order term gives:
This approximation captures curvature — whether the function bends up or down.
Multivariable Quadratic Approximation
For :
where is the Hessian matrix at . The quadratic term (where ) models how the function curves in every direction.
This quadratic model is what second-order optimization methods minimize at each step.
Newton’s Method
If we have the quadratic approximation, why not minimize it directly? Setting the derivative of the quadratic model to zero:
Solving for :
This gives the Newton’s method update rule:
Newton vs Gradient Descent
| Property | Gradient Descent | Newton’s Method |
|---|---|---|
| Approximation | First-order (linear) | Second-order (quadratic) |
| Update | ||
| Step size | Requires tuning | No learning rate needed |
| Convergence | Linear: | Quadratic: error squares each step |
| Per-step cost | (invert Hessian) | |
| Memory | (store Hessian) |
Newton’s method converges much faster — quadratic convergence means the number of correct digits roughly doubles each iteration. But the cost is prohibitive for large . For a neural network with parameters, storing the Hessian requires entries — impossibly large.
Key insight: Gradient descent is a first-order method (uses only gradients). Newton’s method is a second-order method (uses gradients and curvature). The trade-off between convergence speed and per-step cost is the central tension in optimization, and most practical optimizers live somewhere in between.
Geometric Interpretation
Gradient descent steps in the steepest direction with a fixed step size. Newton’s method steps to the minimum of the local quadratic model. On an elongated valley (ill-conditioned Hessian), gradient descent zigzags back and forth, while Newton’s method jumps directly to the bottom in one step.
Quasi-Newton Methods
Quasi-Newton methods approximate the Hessian without computing it explicitly, getting some of Newton’s convergence benefit at lower cost.
BFGS
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm maintains an approximation and updates it using gradient information from successive steps. It achieves superlinear convergence — faster than gradient descent, slower than Newton.
The update uses only gradient evaluations (no second derivatives):
where and .
L-BFGS
Limited-memory BFGS (L-BFGS) stores only the last pairs instead of the full matrix. This reduces memory from to , making it practical for moderately large problems. L-BFGS is the default optimizer in many classical ML libraries (e.g., scikit-learn’s logistic regression).
Common Taylor Approximations for ML
These first-order approximations are used constantly in derivations and analyses:
| Function | Approximation (near ) | Where it appears |
|---|---|---|
| Learning rate analysis, Poisson | ||
| Log-loss simplification | ||
| Binomial approximation | ||
| Sigmoid linearization | ||
| Tanh near origin | ||
| Norm approximations |
Worked Example: Predicting Loss After a Gradient Step
Suppose our loss function at the current parameters has:
- Value:
- Gradient norm:
- Maximum Hessian eigenvalue:
With learning rate , the gradient descent update is .
Using the second-order Taylor approximation:
The loss decreases by about 0.024. The first term () is the decrease from the gradient step. The second term () is the “overshoot” from the curvature. For convergence, we need the decrease to dominate the overshoot, which requires .
Why This Matters for ML
Taylor approximations are the theoretical lens through which all optimization methods are understood:
- Gradient descent minimizes the first-order Taylor approximation within a trust region controlled by the learning rate
- Newton’s method minimizes the second-order Taylor approximation, achieving faster convergence at higher computational cost
- Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian cheaply, balancing speed and cost
- Learning rate bounds ( for -smooth functions) come directly from ensuring the Taylor approximation is valid
- Adaptive methods like Adam can be interpreted as diagonal approximations to Newton’s method
- The quality of the local approximation explains why gradient descent converges slowly on ill-conditioned problems
Summary
- The Taylor series approximates a function as a polynomial using its derivatives at a point
- Linear approximation (first-order Taylor) = tangent line = what gradient descent uses
- Quadratic approximation (second-order Taylor) captures curvature via the Hessian = what Newton’s method uses
- Newton’s method converges quadratically but costs per step — too expensive for deep learning
- Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian cheaply using gradient history
- The learning rate must be small enough that the linear approximation remains valid:
- These ideas directly motivate gradient descent and its variants
References
- Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapter 11.
- Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3, 6-7.
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 9. stanford.edu
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org