Gradient Descent: The Workhorse of Machine Learning Optimization

Master gradient descent from first principles — the algorithm, learning rate selection, convergence analysis, and local minima in loss landscapes.

Calculus & Optimization March 7, 2026 8 min read

The Optimization Problem

Training a machine learning model means finding parameters θ\boldsymbol{\theta} that minimize a loss function:

θ=argminθL(θ)\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})

For linear regression, we can solve this analytically (normal equations). For neural networks, the loss is a complicated nonlinear function of millions of parameters — no closed-form solution exists. We need an iterative algorithm.

Gradient descent is that algorithm. It uses the first-order Taylor approximation to make a small improvement at each step, and repeats until convergence.

The Gradient Descent Algorithm

The update rule is remarkably simple:

θt+1=θtαθL(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

where α>0\alpha > 0 is the learning rate (step size) and θL\nabla_{\boldsymbol{\theta}} \mathcal{L} is the gradient of the loss.

Intuition: Think of the loss surface as a hilly landscape and the parameters as your position. The gradient points uphill. Taking a step in the L-\nabla \mathcal{L} direction is like walking downhill along the steepest slope. The learning rate controls how big a step you take.

The full algorithm:

import numpy as np

def gradient_descent(grad_fn, theta_init, lr=0.01, n_steps=1000):
    theta = theta_init.copy()
    for t in range(n_steps):
        grad = grad_fn(theta)
        theta = theta - lr * grad
    return theta

At each iteration:

  1. Compute the gradient L(θt)\nabla \mathcal{L}(\boldsymbol{\theta}_t)
  2. Take a step in the negative gradient direction
  3. Repeat until the gradient is approximately zero

The Learning Rate

The learning rate α\alpha is the most important hyperparameter in gradient descent. It controls the trade-off between speed and stability.

Too Large: Divergence

If α\alpha is too large, the algorithm overshoots the minimum and can diverge:

Step 0: θ=5.0,f(θ)=25.0Step 1: θ=51.5(10)=10,f(θ)=100Step 2: θ=101.5(20)=20,f(θ)=400\begin{aligned} &\text{Step 0: } \theta = 5.0, \quad f(\theta) = 25.0 \\[4pt] &\text{Step 1: } \theta = 5 - 1.5(10) = -10, \quad f(\theta) = 100 \\[4pt] &\text{Step 2: } \theta = -10 - 1.5(-20) = 20, \quad f(\theta) = 400 \end{aligned}

The values explode because the step overshoots further each time.

Too Small: Slow Convergence

If α\alpha is very small, the algorithm converges but painfully slowly. For f(θ)=θ2f(\theta) = \theta^2 with α=0.01\alpha = 0.01:

θt+1=θt0.012θt=0.98θt\theta_{t+1} = \theta_t - 0.01 \cdot 2\theta_t = 0.98\theta_t

After tt steps, θt=0.98tθ0\theta_t = 0.98^t \theta_0. Reaching θt<0.01θ0\theta_t < 0.01\theta_0 requires t>228t > 228 steps.

Just Right

For f(θ)=θ2f(\theta) = \theta^2, the optimal learning rate is α=0.5\alpha = 0.5:

θt+1=θt0.52θt=0\theta_{t+1} = \theta_t - 0.5 \cdot 2\theta_t = 0

One step to the exact minimum. In general, the optimal rate depends on the curvature (second derivative), which motivates second-order methods and adaptive optimizers.

Key insight: The maximum stable learning rate is α<2/L\alpha < 2/L, where LL is the Lipschitz constant of the gradient (the maximum curvature). Larger curvature demands smaller steps. This is why loss functions with high curvature are harder to optimize.

Batch Gradient Descent

In ML, the loss is typically an average over the training set:

L(θ)=1ni=1n(θ;xi,yi)\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n} \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i)

Batch gradient descent computes the full gradient using all nn samples:

L(θ)=1ni=1nθ(θ;xi,yi)\nabla \mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n} \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i)

This gives the exact gradient direction at each step.

ProsCons
Deterministic — same path every runExpensive: O(n)O(n) per update
Smooth convergenceMemory-intensive for large nn
Guaranteed descent per stepCannot exploit data redundancy

For datasets with millions of samples, computing the full gradient at every step is wasteful — many samples provide redundant information. This motivates stochastic gradient descent.

Convergence Analysis

Convex Functions

For convex functions with LL-Lipschitz gradients (f(x)f(y)Lxy\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\|) and learning rate α=1/L\alpha = 1/L:

f(θt)f(θ)Lθ0θ22tf(\boldsymbol{\theta}_t) - f(\boldsymbol{\theta}^*) \leq \frac{L\|\boldsymbol{\theta}_0 - \boldsymbol{\theta}^*\|^2}{2t}

This means gradient descent converges at rate O(1/t)O(1/t) — to halve the error, you need twice as many steps.

Strongly Convex Functions

If ff is also strongly convex with parameter μ\mu (minimum curvature), convergence is exponentially faster:

f(θt)f(θ)(1μL)t[f(θ0)f(θ)]f(\boldsymbol{\theta}_t) - f(\boldsymbol{\theta}^*) \leq \left(1 - \frac{\mu}{L}\right)^t [f(\boldsymbol{\theta}_0) - f(\boldsymbol{\theta}^*)]

The ratio κ=L/μ\kappa = L/\mu is the condition number. A large κ\kappa means the loss surface is elongated (some directions have high curvature, others low), and gradient descent zigzags slowly. A well-conditioned problem (κ\kappa close to 1) converges quickly.

Key insight: The condition number κ=L/μ\kappa = L/\mu determines how many iterations gradient descent needs. Ill-conditioned problems (large κ\kappa) cause zigzagging. This is the fundamental motivation for preconditioning and adaptive learning rates.

Sufficient Decrease (Armijo Condition)

In practice, we often want a formal guarantee that each step reduces the loss. The Armijo condition requires:

L(θαL)L(θ)cαL2\mathcal{L}(\boldsymbol{\theta} - \alpha \nabla \mathcal{L}) \leq \mathcal{L}(\boldsymbol{\theta}) - c \cdot \alpha \|\nabla \mathcal{L}\|^2

where c(0,1)c \in (0, 1) is a small constant (typically 10410^{-4}). Line search methods find the largest α\alpha satisfying this condition.

Local Minima and Saddle Points

For non-convex functions (like neural network losses), the landscape is complex.

Critical Points

A critical point is where L=0\nabla \mathcal{L} = \mathbf{0}. Critical points can be:

  • Local minimum: loss increases in all directions (Hessian positive definite)
  • Local maximum: loss decreases in all directions (Hessian negative definite)
  • Saddle point: loss increases in some directions and decreases in others (Hessian indefinite)

Saddle Points Dominate in High Dimensions

For a random function in nn dimensions, a critical point is a local minimum only if all nn Hessian eigenvalues are positive. If each eigenvalue is equally likely to be positive or negative, the probability is (1/2)n(1/2)^n — vanishingly small for large nn.

Key insight: In the high-dimensional parameter spaces of neural networks, most critical points are saddle points, not local minima. Gradient descent naturally escapes saddle points because the gradient is nonzero in the directions of negative curvature. The real challenge is navigating flat regions where the gradient is small but nonzero.

Local Minima are Often Good Enough

Empirical evidence suggests that local minima in deep networks tend to have similar loss values to the global minimum. The concern is not getting stuck in bad local minima but rather the speed of convergence through flat regions and narrow valleys.

Worked Example: Linear Regression

Consider fitting y^=w1x+w2\hat{y} = w_1 x + w_2 with MSE loss over 4 data points: (1,2),(2,4),(3,5),(4,4)(1, 2), (2, 4), (3, 5), (4, 4).

L(w1,w2)=14i=14(yiw1xiw2)2\mathcal{L}(w_1, w_2) = \frac{1}{4}\sum_{i=1}^{4}(y_i - w_1 x_i - w_2)^2

The gradients:

Lw1=12i=14xi(yiw1xiw2)Lw2=12i=14(yiw1xiw2)\begin{aligned} \frac{\partial \mathcal{L}}{\partial w_1} &= -\frac{1}{2}\sum_{i=1}^{4} x_i(y_i - w_1 x_i - w_2) \\[6pt] \frac{\partial \mathcal{L}}{\partial w_2} &= -\frac{1}{2}\sum_{i=1}^{4} (y_i - w_1 x_i - w_2) \end{aligned}
import numpy as np

X = np.array([1, 2, 3, 4])
y = np.array([2, 4, 5, 4])

w1, w2 = 0.0, 0.0  # initial parameters
lr = 0.01

for step in range(500):
    pred = w1 * X + w2
    residual = y - pred
    grad_w1 = -2 * np.mean(X * residual)
    grad_w2 = -2 * np.mean(residual)
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

print(f"w1 = {w1:.4f}, w2 = {w2:.4f}")
# w1 ≈ 0.8, w2 ≈ 1.5

The parameters converge to the least-squares solution. With a larger learning rate, convergence is faster but risks instability. With a smaller rate, it is slower but more stable.

Gradient Descent Variants Preview

Batch gradient descent is the foundation, but practical training uses several improvements:

  • Stochastic Gradient Descent (SGD): Uses a random subset of data per step — much faster for large datasets
  • Momentum: Accumulates velocity to smooth out oscillations and accelerate through flat regions
  • Adaptive methods: Adam, RMSProp, AdaGrad — automatically adjust learning rates per parameter

These build on the foundation established here, addressing the practical limitations of vanilla gradient descent.

Why This Matters for ML

Gradient descent is the algorithm that trains virtually every neural network and many classical ML models:

  • It is the training loop: forward pass \to compute loss \to backward pass \to update parameters
  • Learning rate is the single most impactful hyperparameter — it determines whether training converges, how fast, and to what quality solution
  • Convergence rate depends on the condition number κ\kappa of the loss surface — ill-conditioned problems need more iterations or smarter optimizers
  • Non-convexity in deep learning means multiple local minima and saddle points, but empirically gradient descent finds good solutions
  • Every variant (SGD, Adam, etc.) is a modification of this core algorithm

Summary

  • Gradient descent iterates θt+1=θtαL\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \nabla \mathcal{L} to minimize the loss
  • The learning rate α\alpha must satisfy α<2/L\alpha < 2/L where LL is the gradient Lipschitz constant
  • Batch gradient descent uses the full dataset — exact but expensive for large nn
  • Convergence is O(1/t)O(1/t) for convex functions and O((1μ/L)t)O((1 - \mu/L)^t) for strongly convex
  • The condition number κ=L/μ\kappa = L/\mu determines convergence speed — ill-conditioned problems zigzag
  • In high dimensions, saddle points vastly outnumber local minima
  • Next: stochastic gradient descent trades exact gradients for speed

References

  • Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3.
  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 9. stanford.edu
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
  • Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2), 223-311.
  • Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms. arXiv:1609.04747

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay