Gradient Descent: The Workhorse of Machine Learning Optimization

Calculus & Optimization Series 6 / 18

The Optimization Problem

Training a machine learning model means finding parameters $\boldsymbol{\theta}$ that minimize a loss function:

\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})

For linear regression, we can solve this analytically (normal equations). For neural networks, the loss is a complicated nonlinear function of millions of parameters — no closed-form solution exists. We need an iterative algorithm.

Gradient descent is that algorithm. It uses the first-order Taylor approximation to make a small improvement at each step, and repeats until convergence.

The Gradient Descent Algorithm

The update rule is remarkably simple:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

where $\alpha > 0$ is the learning rate (step size) and $\nabla_{\boldsymbol{\theta}} \mathcal{L}$ is the gradient of the loss.

Intuition: Think of the loss surface as a hilly landscape and the parameters as your position. The gradient points uphill. Taking a step in the $-\nabla \mathcal{L}$ direction is like walking downhill along the steepest slope. The learning rate controls how big a step you take.

The full algorithm:

import numpy as np

def gradient_descent(grad_fn, theta_init, lr=0.01, n_steps=1000):
    theta = theta_init.copy()
    for t in range(n_steps):
        grad = grad_fn(theta)
        theta = theta - lr * grad
    return theta

At each iteration:

Compute the gradient $\nabla \mathcal{L}(\boldsymbol{\theta}_t)$
Take a step in the negative gradient direction
Repeat until the gradient is approximately zero

The Learning Rate

The learning rate $\alpha$ is the most important hyperparameter in gradient descent. It controls the trade-off between speed and stability.

Too Large: Divergence

If $\alpha$ is too large, the algorithm overshoots the minimum and can diverge:

\begin{aligned} &\text{Step 0: } \theta = 5.0, \quad f(\theta) = 25.0 \\[4pt] &\text{Step 1: } \theta = 5 - 1.5(10) = -10, \quad f(\theta) = 100 \\[4pt] &\text{Step 2: } \theta = -10 - 1.5(-20) = 20, \quad f(\theta) = 400 \end{aligned}

The values explode because the step overshoots further each time.

Too Small: Slow Convergence

If $\alpha$ is very small, the algorithm converges but painfully slowly. For $f(\theta) = \theta^2$ with $\alpha = 0.01$ :

\theta_{t+1} = \theta_t - 0.01 \cdot 2\theta_t = 0.98\theta_t

After $t$ steps, $\theta_t = 0.98^t \theta_0$ . Reaching $\theta_t < 0.01\theta_0$ requires $t > 228$ steps.

Just Right

For $f(\theta) = \theta^2$ , the optimal learning rate is $\alpha = 0.5$ :

\theta_{t+1} = \theta_t - 0.5 \cdot 2\theta_t = 0

One step to the exact minimum. In general, the optimal rate depends on the curvature (second derivative), which motivates second-order methods and adaptive optimizers.

Key insight: The maximum stable learning rate is $\alpha < 2/L$ , where $L$ is the Lipschitz constant of the gradient (the maximum curvature). Larger curvature demands smaller steps. This is why loss functions with high curvature are harder to optimize.

Batch Gradient Descent

In ML, the loss is typically an average over the training set:

\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n} \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i)

Batch gradient descent computes the full gradient using all $n$ samples:

\nabla \mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n} \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i)

This gives the exact gradient direction at each step.

Pros	Cons
Deterministic — same path every run	Expensive: $O(n)$ per update
Smooth convergence	Memory-intensive for large $n$
Guaranteed descent per step	Cannot exploit data redundancy

For datasets with millions of samples, computing the full gradient at every step is wasteful — many samples provide redundant information. This motivates stochastic gradient descent.

Convergence Analysis

Convex Functions

For convex functions with $L$ -Lipschitz gradients ( $\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\|$ ) and learning rate $\alpha = 1/L$ :

f(\boldsymbol{\theta}_t) - f(\boldsymbol{\theta}^*) \leq \frac{L\|\boldsymbol{\theta}_0 - \boldsymbol{\theta}^*\|^2}{2t}

This means gradient descent converges at rate $O(1/t)$ — to halve the error, you need twice as many steps.

Strongly Convex Functions

If $f$ is also strongly convex with parameter $\mu$ (minimum curvature), convergence is exponentially faster:

f(\boldsymbol{\theta}_t) - f(\boldsymbol{\theta}^*) \leq \left(1 - \frac{\mu}{L}\right)^t [f(\boldsymbol{\theta}_0) - f(\boldsymbol{\theta}^*)]

The ratio $\kappa = L/\mu$ is the condition number. A large $\kappa$ means the loss surface is elongated (some directions have high curvature, others low), and gradient descent zigzags slowly. A well-conditioned problem ( $\kappa$ close to 1) converges quickly.

Key insight: The condition number $\kappa = L/\mu$ determines how many iterations gradient descent needs. Ill-conditioned problems (large $\kappa$ ) cause zigzagging. This is the fundamental motivation for preconditioning and adaptive learning rates.

Sufficient Decrease (Armijo Condition)

In practice, we often want a formal guarantee that each step reduces the loss. The Armijo condition requires:

\mathcal{L}(\boldsymbol{\theta} - \alpha \nabla \mathcal{L}) \leq \mathcal{L}(\boldsymbol{\theta}) - c \cdot \alpha \|\nabla \mathcal{L}\|^2

where $c \in (0, 1)$ is a small constant (typically $10^{-4}$ ). Line search methods find the largest $\alpha$ satisfying this condition.

Local Minima and Saddle Points

For non-convex functions (like neural network losses), the landscape is complex.

Critical Points

A critical point is where $\nabla \mathcal{L} = \mathbf{0}$ . Critical points can be:

Local minimum: loss increases in all directions (Hessian positive definite)
Local maximum: loss decreases in all directions (Hessian negative definite)
Saddle point: loss increases in some directions and decreases in others (Hessian indefinite)

Saddle Points Dominate in High Dimensions

For a random function in $n$ dimensions, a critical point is a local minimum only if all $n$ Hessian eigenvalues are positive. If each eigenvalue is equally likely to be positive or negative, the probability is $(1/2)^n$ — vanishingly small for large $n$ .

Key insight: In the high-dimensional parameter spaces of neural networks, most critical points are saddle points, not local minima. Gradient descent naturally escapes saddle points because the gradient is nonzero in the directions of negative curvature. The real challenge is navigating flat regions where the gradient is small but nonzero.

Local Minima are Often Good Enough

Empirical evidence suggests that local minima in deep networks tend to have similar loss values to the global minimum. The concern is not getting stuck in bad local minima but rather the speed of convergence through flat regions and narrow valleys.

Worked Example: Linear Regression

Consider fitting $\hat{y} = w_1 x + w_2$ with MSE loss over 4 data points: $(1, 2), (2, 4), (3, 5), (4, 4)$ .

\mathcal{L}(w_1, w_2) = \frac{1}{4}\sum_{i=1}^{4}(y_i - w_1 x_i - w_2)^2

The gradients:

\begin{aligned} \frac{\partial \mathcal{L}}{\partial w_1} &= -\frac{1}{2}\sum_{i=1}^{4} x_i(y_i - w_1 x_i - w_2) \\[6pt] \frac{\partial \mathcal{L}}{\partial w_2} &= -\frac{1}{2}\sum_{i=1}^{4} (y_i - w_1 x_i - w_2) \end{aligned}

import numpy as np

X = np.array([1, 2, 3, 4])
y = np.array([2, 4, 5, 4])

w1, w2 = 0.0, 0.0  # initial parameters
lr = 0.01

for step in range(500):
    pred = w1 * X + w2
    residual = y - pred
    grad_w1 = -2 * np.mean(X * residual)
    grad_w2 = -2 * np.mean(residual)
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

print(f"w1 = {w1:.4f}, w2 = {w2:.4f}")
# w1 ≈ 0.8, w2 ≈ 1.5

The parameters converge to the least-squares solution. With a larger learning rate, convergence is faster but risks instability. With a smaller rate, it is slower but more stable.

Gradient Descent Variants Preview

Batch gradient descent is the foundation, but practical training uses several improvements:

Stochastic Gradient Descent (SGD): Uses a random subset of data per step — much faster for large datasets
Momentum: Accumulates velocity to smooth out oscillations and accelerate through flat regions
Adaptive methods: Adam, RMSProp, AdaGrad — automatically adjust learning rates per parameter

These build on the foundation established here, addressing the practical limitations of vanilla gradient descent.

Why This Matters for ML

Gradient descent is the algorithm that trains virtually every neural network and many classical ML models:

It is the training loop: forward pass $\to$ compute loss $\to$ backward pass $\to$ update parameters
Learning rate is the single most impactful hyperparameter — it determines whether training converges, how fast, and to what quality solution
Convergence rate depends on the condition number $\kappa$ of the loss surface — ill-conditioned problems need more iterations or smarter optimizers
Non-convexity in deep learning means multiple local minima and saddle points, but empirically gradient descent finds good solutions
Every variant (SGD, Adam, etc.) is a modification of this core algorithm

Summary

Gradient descent iterates $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \nabla \mathcal{L}$ to minimize the loss
The learning rate $\alpha$ must satisfy $\alpha < 2/L$ where $L$ is the gradient Lipschitz constant
Batch gradient descent uses the full dataset — exact but expensive for large $n$
Convergence is $O(1/t)$ for convex functions and $O((1 - \mu/L)^t)$ for strongly convex
The condition number $\kappa = L/\mu$ determines convergence speed — ill-conditioned problems zigzag
In high dimensions, saddle points vastly outnumber local minima
Next: stochastic gradient descent trades exact gradients for speed

References

Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 9. stanford.edu
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2), 223-311.
Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms. arXiv:1609.04747