Taylor Series and Approximation: Local Models of Complex Functions

Understand Taylor expansions, linearization, quadratic approximation, and Newton's method — the math connecting derivatives to optimization.

Calculus & Optimization March 7, 2026 8 min read

The Idea of Local Approximation

Complex functions can be approximated by simpler ones near a given point. This is not just a mathematical curiosity — it is the foundation of all gradient-based optimization. Gradient descent uses a linear approximation. Newton’s method uses a quadratic approximation. Understanding these approximations reveals why optimization algorithms behave the way they do.

This article connects the derivatives and chain rule we have developed to the optimization algorithms that follow.

Taylor Series in One Variable

The Taylor series of a function ff centered at aa is the infinite polynomial:

f(x)=n=0f(n)(a)n!(xa)nf(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x - a)^n

Expanded:

f(x)=f(a)+f(a)(xa)+f(a)2!(xa)2+f(a)3!(xa)3+f(x) = f(a) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \cdots

When a=0a = 0, this is called the Maclaurin series.

Key Expansions

These Taylor series appear throughout ML derivations:

ex=1+x+x22!+x33!+ln(1+x)=xx22+x33for x<111x=1+x+x2+x3+for x<1sinx=xx33!+x55!cosx=1x22!+x44!\begin{aligned} e^x &= 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots \\[6pt] \ln(1 + x) &= x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots \quad \text{for } |x| < 1 \\[6pt] \frac{1}{1 - x} &= 1 + x + x^2 + x^3 + \cdots \quad \text{for } |x| < 1 \\[6pt] \sin x &= x - \frac{x^3}{3!} + \frac{x^5}{5!} - \cdots \\[6pt] \cos x &= 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \cdots \end{aligned}

Key insight: ex1+xe^x \approx 1 + x for small xx. This approximation is used constantly — from the definition of the Poisson distribution to the analysis of learning rate effects in optimization.

Taylor’s Theorem and the Remainder

The kk-th order Taylor polynomial Tk(x)T_k(x) uses only the first k+1k+1 terms. The error of this approximation is bounded by Taylor’s theorem:

f(x)=Tk(x)+Rk(x)f(x) = T_k(x) + R_k(x)

where the Lagrange remainder is:

Rk(x)=f(k+1)(c)(k+1)!(xa)k+1R_k(x) = \frac{f^{(k+1)}(c)}{(k+1)!}(x - a)^{k+1}

for some cc between aa and xx. The error is proportional to (xa)k+1(x - a)^{k+1} — the approximation gets better as we move closer to aa or use more terms.

Linear Approximation (First-Order Taylor)

Truncating the Taylor series after the first-order term gives:

f(x)f(a)+f(a)(xa)f(x) \approx f(a) + f'(a)(x - a)

This is the tangent line at aa — the best linear approximation of ff near aa.

Multivariable Linearization

For f:RnRf: \mathbb{R}^n \to \mathbb{R}:

f(x)f(a)+f(a)T(xa)f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x} - \mathbf{a})

This first-order approximation is exactly what gradient descent uses. At each step, we approximate the loss as a linear function of the parameters and move in the direction that decreases this linear model.

Key insight: Gradient descent trusts the linear approximation within a small region around the current point. The learning rate controls the size of this region. If the learning rate is too large, we step beyond where the linear approximation is accurate, and the algorithm can diverge.

How Good is the Linear Approximation?

For a function with bounded second derivative fM|f''| \leq M, the linearization error is:

f(x)T1(x)M2(xa)2|f(x) - T_1(x)| \leq \frac{M}{2}(x - a)^2

The error grows quadratically with distance from aa. This is why gradient descent with a small step size converges but a large step size can overshoot — the linear approximation breaks down far from the current point.

Quadratic Approximation (Second-Order Taylor)

Including the second-order term gives:

f(x)f(a)+f(a)(xa)+12f(a)(xa)2f(x) \approx f(a) + f'(a)(x - a) + \frac{1}{2}f''(a)(x - a)^2

This approximation captures curvature — whether the function bends up or down.

Multivariable Quadratic Approximation

For f:RnRf: \mathbb{R}^n \to \mathbb{R}:

f(x)f(a)+f(a)T(xa)+12(xa)TH(a)(xa)f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x} - \mathbf{a}) + \frac{1}{2}(\mathbf{x} - \mathbf{a})^T \mathbf{H}(\mathbf{a})(\mathbf{x} - \mathbf{a})

where H(a)\mathbf{H}(\mathbf{a}) is the Hessian matrix at a\mathbf{a}. The quadratic term 12dTHd\frac{1}{2}\mathbf{d}^T \mathbf{H} \mathbf{d} (where d=xa\mathbf{d} = \mathbf{x} - \mathbf{a}) models how the function curves in every direction.

This quadratic model is what second-order optimization methods minimize at each step.

Newton’s Method

If we have the quadratic approximation, why not minimize it directly? Setting the derivative of the quadratic model to zero:

f(a)+H(a)(xa)=0\nabla f(\mathbf{a}) + \mathbf{H}(\mathbf{a})(\mathbf{x} - \mathbf{a}) = \mathbf{0}

Solving for x\mathbf{x}:

x=aH(a)1f(a)\mathbf{x}^* = \mathbf{a} - \mathbf{H}(\mathbf{a})^{-1} \nabla f(\mathbf{a})

This gives the Newton’s method update rule:

θt+1=θtHt1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \mathbf{H}_t^{-1} \nabla f(\boldsymbol{\theta}_t)

Newton vs Gradient Descent

PropertyGradient DescentNewton’s Method
ApproximationFirst-order (linear)Second-order (quadratic)
Updateαf-\alpha \nabla fH1f-\mathbf{H}^{-1} \nabla f
Step sizeRequires tuning α\alphaNo learning rate needed
ConvergenceLinear: O(1/t)O(1/t)Quadratic: error squares each step
Per-step costO(n)O(n)O(n3)O(n^3) (invert Hessian)
MemoryO(n)O(n)O(n2)O(n^2) (store Hessian)

Newton’s method converges much faster — quadratic convergence means the number of correct digits roughly doubles each iteration. But the cost is prohibitive for large nn. For a neural network with n=108n = 10^8 parameters, storing the Hessian requires 101610^{16} entries — impossibly large.

Key insight: Gradient descent is a first-order method (uses only gradients). Newton’s method is a second-order method (uses gradients and curvature). The trade-off between convergence speed and per-step cost is the central tension in optimization, and most practical optimizers live somewhere in between.

Geometric Interpretation

Gradient descent steps in the steepest direction with a fixed step size. Newton’s method steps to the minimum of the local quadratic model. On an elongated valley (ill-conditioned Hessian), gradient descent zigzags back and forth, while Newton’s method jumps directly to the bottom in one step.

Quasi-Newton Methods

Quasi-Newton methods approximate the Hessian without computing it explicitly, getting some of Newton’s convergence benefit at lower cost.

BFGS

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm maintains an approximation BtHt\mathbf{B}_t \approx \mathbf{H}_t and updates it using gradient information from successive steps. It achieves superlinear convergence — faster than gradient descent, slower than Newton.

The update uses only gradient evaluations (no second derivatives):

Bt+1=Bt+yyTyTsBtssTBtsTBts\mathbf{B}_{t+1} = \mathbf{B}_t + \frac{\mathbf{y}\mathbf{y}^T}{\mathbf{y}^T\mathbf{s}} - \frac{\mathbf{B}_t \mathbf{s} \mathbf{s}^T \mathbf{B}_t}{\mathbf{s}^T \mathbf{B}_t \mathbf{s}}

where s=θt+1θt\mathbf{s} = \boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t and y=ft+1ft\mathbf{y} = \nabla f_{t+1} - \nabla f_t.

L-BFGS

Limited-memory BFGS (L-BFGS) stores only the last mm pairs (si,yi)(\mathbf{s}_i, \mathbf{y}_i) instead of the full n×nn \times n matrix. This reduces memory from O(n2)O(n^2) to O(mn)O(mn), making it practical for moderately large problems. L-BFGS is the default optimizer in many classical ML libraries (e.g., scikit-learn’s logistic regression).

Common Taylor Approximations for ML

These first-order approximations are used constantly in derivations and analyses:

FunctionApproximation (near x=0x = 0)Where it appears
exe^x1+x1 + xLearning rate analysis, Poisson
ln(1+x)\ln(1 + x)xxLog-loss simplification
(1+x)n(1 + x)^n1+nx1 + nxBinomial approximation
σ(x)\sigma(x)12+x4\frac{1}{2} + \frac{x}{4}Sigmoid linearization
tanh(x)\tanh(x)xxTanh near origin
1+x\sqrt{1 + x}1+x21 + \frac{x}{2}Norm approximations

Worked Example: Predicting Loss After a Gradient Step

Suppose our loss function at the current parameters θt\boldsymbol{\theta}_t has:

  • Value: L(θt)=2.5\mathcal{L}(\boldsymbol{\theta}_t) = 2.5
  • Gradient norm: L=0.8\|\nabla \mathcal{L}\| = 0.8
  • Maximum Hessian eigenvalue: λmax=10\lambda_{\max} = 10

With learning rate α=0.05\alpha = 0.05, the gradient descent update is θt+1=θt0.05L\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - 0.05 \nabla \mathcal{L}.

Using the second-order Taylor approximation:

L(θt+1)L(θt)+LT(αL)+12(αL)TH(αL)=L(θt)αL2+α22LTHL2.50.05(0.64)+0.00252(10)(0.64)=2.50.032+0.008=2.476\begin{aligned} \mathcal{L}(\boldsymbol{\theta}_{t+1}) &\approx \mathcal{L}(\boldsymbol{\theta}_t) + \nabla \mathcal{L}^T(-\alpha \nabla \mathcal{L}) + \frac{1}{2}(-\alpha \nabla \mathcal{L})^T \mathbf{H} (-\alpha \nabla \mathcal{L}) \\[6pt] &= \mathcal{L}(\boldsymbol{\theta}_t) - \alpha \|\nabla \mathcal{L}\|^2 + \frac{\alpha^2}{2} \nabla \mathcal{L}^T \mathbf{H} \nabla \mathcal{L} \\[6pt] &\leq 2.5 - 0.05(0.64) + \frac{0.0025}{2}(10)(0.64) \\[6pt] &= 2.5 - 0.032 + 0.008 \\[6pt] &= 2.476 \end{aligned}

The loss decreases by about 0.024. The first term (0.032-0.032) is the decrease from the gradient step. The second term (+0.008+0.008) is the “overshoot” from the curvature. For convergence, we need the decrease to dominate the overshoot, which requires α<2/λmax=0.2\alpha < 2/\lambda_{\max} = 0.2.

Why This Matters for ML

Taylor approximations are the theoretical lens through which all optimization methods are understood:

  • Gradient descent minimizes the first-order Taylor approximation within a trust region controlled by the learning rate
  • Newton’s method minimizes the second-order Taylor approximation, achieving faster convergence at higher computational cost
  • Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian cheaply, balancing speed and cost
  • Learning rate bounds (α<2/L\alpha < 2/L for LL-smooth functions) come directly from ensuring the Taylor approximation is valid
  • Adaptive methods like Adam can be interpreted as diagonal approximations to Newton’s method
  • The quality of the local approximation explains why gradient descent converges slowly on ill-conditioned problems

Summary

  • The Taylor series approximates a function as a polynomial using its derivatives at a point
  • Linear approximation (first-order Taylor) = tangent line = what gradient descent uses
  • Quadratic approximation (second-order Taylor) captures curvature via the Hessian = what Newton’s method uses
  • Newton’s method converges quadratically but costs O(n3)O(n^3) per step — too expensive for deep learning
  • Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian cheaply using gradient history
  • The learning rate must be small enough that the linear approximation remains valid: α<2/λmax(H)\alpha < 2/\lambda_{\max}(\mathbf{H})
  • These ideas directly motivate gradient descent and its variants

References

  • Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapter 11.
  • Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3, 6-7.
  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 9. stanford.edu
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay