Taylor Series and Approximation: Local Models of Complex Functions

Calculus & Optimization Series 5 / 18

The Idea of Local Approximation

Complex functions can be approximated by simpler ones near a given point. This is not just a mathematical curiosity — it is the foundation of all gradient-based optimization. Gradient descent uses a linear approximation. Newton’s method uses a quadratic approximation. Understanding these approximations reveals why optimization algorithms behave the way they do.

This article connects the derivatives and chain rule we have developed to the optimization algorithms that follow.

Taylor Series in One Variable

The Taylor series of a function $f$ centered at $a$ is the infinite polynomial:

f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x - a)^n

Expanded:

f(x) = f(a) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \cdots

When $a = 0$ , this is called the Maclaurin series.

Key Expansions

These Taylor series appear throughout ML derivations:

\begin{aligned} e^x &= 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots \\[6pt] \ln(1 + x) &= x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots \quad \text{for } |x| < 1 \\[6pt] \frac{1}{1 - x} &= 1 + x + x^2 + x^3 + \cdots \quad \text{for } |x| < 1 \\[6pt] \sin x &= x - \frac{x^3}{3!} + \frac{x^5}{5!} - \cdots \\[6pt] \cos x &= 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \cdots \end{aligned}

Key insight: $e^x \approx 1 + x$ for small $x$ . This approximation is used constantly — from the definition of the Poisson distribution to the analysis of learning rate effects in optimization.

Taylor’s Theorem and the Remainder

The $k$ -th order Taylor polynomial $T_k(x)$ uses only the first $k+1$ terms. The error of this approximation is bounded by Taylor’s theorem:

f(x) = T_k(x) + R_k(x)

where the Lagrange remainder is:

R_k(x) = \frac{f^{(k+1)}(c)}{(k+1)!}(x - a)^{k+1}

for some $c$ between $a$ and $x$ . The error is proportional to $(x - a)^{k+1}$ — the approximation gets better as we move closer to $a$ or use more terms.

Linear Approximation (First-Order Taylor)

Truncating the Taylor series after the first-order term gives:

f(x) \approx f(a) + f'(a)(x - a)

This is the tangent line at $a$ — the best linear approximation of $f$ near $a$ .

Multivariable Linearization

For $f: \mathbb{R}^n \to \mathbb{R}$ :

f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x} - \mathbf{a})

This first-order approximation is exactly what gradient descent uses. At each step, we approximate the loss as a linear function of the parameters and move in the direction that decreases this linear model.

Key insight: Gradient descent trusts the linear approximation within a small region around the current point. The learning rate controls the size of this region. If the learning rate is too large, we step beyond where the linear approximation is accurate, and the algorithm can diverge.

How Good is the Linear Approximation?

For a function with bounded second derivative $|f''| \leq M$ , the linearization error is:

|f(x) - T_1(x)| \leq \frac{M}{2}(x - a)^2

The error grows quadratically with distance from $a$ . This is why gradient descent with a small step size converges but a large step size can overshoot — the linear approximation breaks down far from the current point.

Quadratic Approximation (Second-Order Taylor)

Including the second-order term gives:

f(x) \approx f(a) + f'(a)(x - a) + \frac{1}{2}f''(a)(x - a)^2

This approximation captures curvature — whether the function bends up or down.

Multivariable Quadratic Approximation

For $f: \mathbb{R}^n \to \mathbb{R}$ :

f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x} - \mathbf{a}) + \frac{1}{2}(\mathbf{x} - \mathbf{a})^T \mathbf{H}(\mathbf{a})(\mathbf{x} - \mathbf{a})

where $\mathbf{H}(\mathbf{a})$ is the Hessian matrix at $\mathbf{a}$ . The quadratic term $\frac{1}{2}\mathbf{d}^T \mathbf{H} \mathbf{d}$ (where $\mathbf{d} = \mathbf{x} - \mathbf{a}$ ) models how the function curves in every direction.

This quadratic model is what second-order optimization methods minimize at each step.

Newton’s Method

If we have the quadratic approximation, why not minimize it directly? Setting the derivative of the quadratic model to zero:

\nabla f(\mathbf{a}) + \mathbf{H}(\mathbf{a})(\mathbf{x} - \mathbf{a}) = \mathbf{0}

Solving for $\mathbf{x}$ :

\mathbf{x}^* = \mathbf{a} - \mathbf{H}(\mathbf{a})^{-1} \nabla f(\mathbf{a})

This gives the Newton’s method update rule:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \mathbf{H}_t^{-1} \nabla f(\boldsymbol{\theta}_t)

Newton vs Gradient Descent

Property	Gradient Descent	Newton’s Method
Approximation	First-order (linear)	Second-order (quadratic)
Update	$-\alpha \nabla f$	$-\mathbf{H}^{-1} \nabla f$
Step size	Requires tuning $\alpha$	No learning rate needed
Convergence	Linear: $O(1/t)$	Quadratic: error squares each step
Per-step cost	$O(n)$	$O(n^3)$ (invert Hessian)
Memory	$O(n)$	$O(n^2)$ (store Hessian)

Newton’s method converges much faster — quadratic convergence means the number of correct digits roughly doubles each iteration. But the cost is prohibitive for large $n$ . For a neural network with $n = 10^8$ parameters, storing the Hessian requires $10^{16}$ entries — impossibly large.

Key insight: Gradient descent is a first-order method (uses only gradients). Newton’s method is a second-order method (uses gradients and curvature). The trade-off between convergence speed and per-step cost is the central tension in optimization, and most practical optimizers live somewhere in between.

Geometric Interpretation

Gradient descent steps in the steepest direction with a fixed step size. Newton’s method steps to the minimum of the local quadratic model. On an elongated valley (ill-conditioned Hessian), gradient descent zigzags back and forth, while Newton’s method jumps directly to the bottom in one step.

Quasi-Newton Methods

Quasi-Newton methods approximate the Hessian without computing it explicitly, getting some of Newton’s convergence benefit at lower cost.

BFGS

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm maintains an approximation $\mathbf{B}_t \approx \mathbf{H}_t$ and updates it using gradient information from successive steps. It achieves superlinear convergence — faster than gradient descent, slower than Newton.

The update uses only gradient evaluations (no second derivatives):

\mathbf{B}_{t+1} = \mathbf{B}_t + \frac{\mathbf{y}\mathbf{y}^T}{\mathbf{y}^T\mathbf{s}} - \frac{\mathbf{B}_t \mathbf{s} \mathbf{s}^T \mathbf{B}_t}{\mathbf{s}^T \mathbf{B}_t \mathbf{s}}

where $\mathbf{s} = \boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t$ and $\mathbf{y} = \nabla f_{t+1} - \nabla f_t$ .

L-BFGS

Limited-memory BFGS (L-BFGS) stores only the last $m$ pairs $(\mathbf{s}_i, \mathbf{y}_i)$ instead of the full $n \times n$ matrix. This reduces memory from $O(n^2)$ to $O(mn)$ , making it practical for moderately large problems. L-BFGS is the default optimizer in many classical ML libraries (e.g., scikit-learn’s logistic regression).

Common Taylor Approximations for ML

These first-order approximations are used constantly in derivations and analyses:

Function	Approximation (near $x = 0$ )	Where it appears
$e^x$	$1 + x$	Learning rate analysis, Poisson
$\ln(1 + x)$	$x$	Log-loss simplification
$(1 + x)^n$	$1 + nx$	Binomial approximation
$\sigma(x)$	$\frac{1}{2} + \frac{x}{4}$	Sigmoid linearization
$\tanh(x)$	$x$	Tanh near origin
$\sqrt{1 + x}$	$1 + \frac{x}{2}$	Norm approximations

Worked Example: Predicting Loss After a Gradient Step

Suppose our loss function at the current parameters $\boldsymbol{\theta}_t$ has:

Value: $\mathcal{L}(\boldsymbol{\theta}_t) = 2.5$
Gradient norm: $\|\nabla \mathcal{L}\| = 0.8$
Maximum Hessian eigenvalue: $\lambda_{\max} = 10$

With learning rate $\alpha = 0.05$ , the gradient descent update is $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - 0.05 \nabla \mathcal{L}$ .

Using the second-order Taylor approximation:

\begin{aligned} \mathcal{L}(\boldsymbol{\theta}_{t+1}) &\approx \mathcal{L}(\boldsymbol{\theta}_t) + \nabla \mathcal{L}^T(-\alpha \nabla \mathcal{L}) + \frac{1}{2}(-\alpha \nabla \mathcal{L})^T \mathbf{H} (-\alpha \nabla \mathcal{L}) \\[6pt] &= \mathcal{L}(\boldsymbol{\theta}_t) - \alpha \|\nabla \mathcal{L}\|^2 + \frac{\alpha^2}{2} \nabla \mathcal{L}^T \mathbf{H} \nabla \mathcal{L} \\[6pt] &\leq 2.5 - 0.05(0.64) + \frac{0.0025}{2}(10)(0.64) \\[6pt] &= 2.5 - 0.032 + 0.008 \\[6pt] &= 2.476 \end{aligned}

The loss decreases by about 0.024. The first term ( $-0.032$ ) is the decrease from the gradient step. The second term ( $+0.008$ ) is the “overshoot” from the curvature. For convergence, we need the decrease to dominate the overshoot, which requires $\alpha < 2/\lambda_{\max} = 0.2$ .

Why This Matters for ML

Taylor approximations are the theoretical lens through which all optimization methods are understood:

Gradient descent minimizes the first-order Taylor approximation within a trust region controlled by the learning rate
Newton’s method minimizes the second-order Taylor approximation, achieving faster convergence at higher computational cost
Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian cheaply, balancing speed and cost
Learning rate bounds ( $\alpha < 2/L$ for $L$ -smooth functions) come directly from ensuring the Taylor approximation is valid
Adaptive methods like Adam can be interpreted as diagonal approximations to Newton’s method
The quality of the local approximation explains why gradient descent converges slowly on ill-conditioned problems

Summary

The Taylor series approximates a function as a polynomial using its derivatives at a point
Linear approximation (first-order Taylor) = tangent line = what gradient descent uses
Quadratic approximation (second-order Taylor) captures curvature via the Hessian = what Newton’s method uses
Newton’s method converges quadratically but costs $O(n^3)$ per step — too expensive for deep learning
Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian cheaply using gradient history
The learning rate must be small enough that the linear approximation remains valid: $\alpha < 2/\lambda_{\max}(\mathbf{H})$
These ideas directly motivate gradient descent and its variants

References

Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapter 11.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3, 6-7.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 9. stanford.edu
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org