Stochastic Gradient Descent: Trading Precision for Speed

Calculus & Optimization Series 7 / 18

The Problem with Batch Gradient Descent

Batch gradient descent computes the gradient using every training example. For ImageNet (1.2 million images) or a language model corpus (billions of tokens), a single gradient computation is enormously expensive. Worse, much of the computation is redundant — many samples provide similar gradient information.

We need a way to make progress without processing the entire dataset at every step.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) replaces the full gradient with a gradient computed from a single randomly sampled training example:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}_t; \mathbf{x}_i, y_i)

where $(\mathbf{x}_i, y_i)$ is drawn uniformly at random from the training set.

The single-sample gradient is an unbiased estimator of the true gradient:

\mathbb{E}_i\left[\nabla \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i)\right] = \nabla \mathcal{L}(\boldsymbol{\theta})

On average, SGD points in the right direction. But any individual step is noisy.

Key insight: SGD trades a precise but expensive gradient for a cheap but noisy estimate. The noise is not a bug — it is a feature. SGD noise helps escape sharp local minima and acts as implicit regularization, often leading to solutions that generalize better.

Noise as a Feature

The stochastic noise in SGD has several beneficial effects:

Escaping sharp minima: Sharp minima (high curvature) are unstable under gradient noise — SGD tends to find flatter minima that generalize better
Implicit regularization: The noise prevents the model from perfectly fitting individual training examples, acting like a regularizer
Exploration: Random fluctuations explore more of the loss landscape than deterministic descent

Mini-Batch Gradient Descent

In practice, we compromise between full-batch and single-sample by using a mini-batch of $B$ samples:

\nabla \mathcal{L} \approx \frac{1}{B}\sum_{j=1}^{B} \nabla \ell(\boldsymbol{\theta}; \mathbf{x}_j, y_j)

The mini-batch gradient has variance inversely proportional to $B$ :

\text{Var}\left[\frac{1}{B}\sum_{j=1}^{B} \nabla \ell_j\right] = \frac{\sigma^2}{B}

where $\sigma^2$ is the per-sample gradient variance. Doubling the batch size halves the variance.

Choosing Batch Size

Batch size	Variance	GPU utilization	Generalization
1 (pure SGD)	High	Poor	Good (lots of noise)
32-128	Moderate	Good	Good
256-1024	Low	Excellent	Usually good
Full dataset	Zero	Depends on size	Can overfit

Common choices are 32, 64, 128, or 256. The sweet spot depends on the hardware (GPU memory), the dataset, and the model.

Key insight: There are diminishing returns to larger batches. Going from $B = 1$ to $B = 32$ reduces variance by 32x. Going from $B = 32$ to $B = 1024$ only reduces it by another 32x, while using 32x more computation per step. Smaller batches make more parameter updates per epoch, which often leads to faster wall-clock convergence.

Learning Rate Schedules

With a fixed learning rate, SGD oscillates around the minimum rather than converging to it (the noise prevents settling). Learning rate schedules reduce $\alpha$ over time to allow convergence.

Constant Learning Rate

The simplest option. Works for getting into the right neighborhood but cannot converge precisely.

Step Decay

Reduce the learning rate by a factor at fixed intervals:

\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t / s \rfloor}

where $\gamma = 0.1$ and $s$ is the step interval. This is the classic schedule for training CNNs on ImageNet — reduce by 10x at epochs 30, 60, and 90.

Exponential Decay

\alpha_t = \alpha_0 \cdot e^{-\lambda t}

Smooth, continuous reduction. The hyperparameter $\lambda$ controls the decay speed.

Cosine Annealing

\alpha_t = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{t \pi}{T}\right)\right)

The learning rate follows a cosine curve from $\alpha_{\max}$ to $\alpha_{\min}$ over $T$ steps. This spends more time at low learning rates (fine-tuning) than at high rates (exploration).

Warmup

Start with a very small learning rate, linearly ramp up to the target rate over the first few hundred or thousand steps, then apply one of the above schedules.

\alpha_t = \begin{cases} \frac{t}{T_w} \alpha_{\max} & t < T_w \\[4pt] \text{schedule}(\alpha_{\max}, t - T_w) & t \geq T_w \end{cases}

Warmup is critical for:

Large batch training: Prevents early divergence when the gradient estimate has high variance at initialization
Transformers: The Adam optimizer can produce very large updates early in training when the second moment estimates are not yet reliable

Momentum

Vanilla SGD oscillates in directions of high curvature and moves slowly in directions of low curvature. Momentum fixes this by accumulating a velocity vector:

\begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla \mathcal{L}(\boldsymbol{\theta}_t) \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \mathbf{v}_{t+1} \end{aligned}

The parameter $\beta \in [0, 1)$ controls how much past gradients influence the current step. Typical value: $\beta = 0.9$ .

Intuition: Imagine a heavy ball rolling down a hilly loss surface. Without momentum, it stops and changes direction at every bump. With momentum, it builds up speed in consistent downhill directions and rolls through small bumps. In narrow valleys, oscillations are dampened while forward progress accelerates.

Why Momentum Helps

Consider a loss surface that is steep in one direction and gentle in another (an elongated valley). Without momentum:

Large gradients in the steep direction cause oscillation
Small gradients in the gentle direction cause slow progress

With momentum:

Oscillating gradients in the steep direction cancel out over time (positive and negative contributions average)
Consistent gradients in the gentle direction accumulate, building speed

The effective step size in the consistent direction is amplified by a factor of $1/(1 - \beta)$ . With $\beta = 0.9$ , this is a 10x speedup.

Nesterov Accelerated Gradient (NAG)

Nesterov momentum makes a subtle but important change: it computes the gradient at the projected position rather than the current position:

\begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla \mathcal{L}(\boldsymbol{\theta}_t - \alpha \beta \mathbf{v}_t) \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \mathbf{v}_{t+1} \end{aligned}

The “look-ahead” gradient allows the algorithm to correct its course before overshooting. Nesterov momentum has provably better convergence rates for convex functions: $O(1/t^2)$ versus $O(1/t)$ for vanilla gradient descent.

Epochs, Iterations, and Shuffling

An epoch is one complete pass through the training dataset
An iteration is one parameter update (one mini-batch)
Iterations per epoch: $\lceil n / B \rceil$ where $n$ is the dataset size and $B$ is batch size

Shuffling the data at the start of each epoch is important. Without shuffling, the model sees data in the same order every epoch, which can create systematic biases in the gradient estimates. Random shuffling ensures each mini-batch is a representative sample.

Worked Example: SGD vs Batch GD

Minimizing $f(\theta) = \frac{1}{n}\sum_{i=1}^{n}(\theta - y_i)^2$ where $y_i$ are data points with mean $\bar{y} = 3$ and some variance:

import numpy as np

np.random.seed(42)
y = np.random.normal(3.0, 1.0, size=1000)

# Batch gradient descent
theta_batch = 0.0
lr = 0.1
for step in range(50):
    grad = 2 * (theta_batch - y.mean())
    theta_batch -= lr * grad
# theta_batch converges smoothly to ~3.0

# SGD with mini-batch
theta_sgd = 0.0
batch_size = 32
for step in range(50):
    idx = np.random.choice(len(y), batch_size)
    grad = 2 * (theta_sgd - y[idx].mean())
    theta_sgd -= lr * grad
# theta_sgd converges noisily to ~3.0

Batch GD takes a smooth path to the optimum. SGD takes a noisy path but each step is 30x cheaper ( $32/1000$ of the computation). Over a fixed compute budget, SGD typically wins.

Variance Reduction (Brief)

For convex problems, advanced variants of SGD can achieve faster convergence by reducing gradient variance:

SVRG (Stochastic Variance-Reduced Gradient): Periodically computes the full gradient and uses it to correct mini-batch estimates
SAGA: Maintains a table of per-sample gradients, updating one at a time

These methods achieve linear convergence for strongly convex problems, matching full-batch performance with stochastic cost. However, they are less commonly used in deep learning, where the non-convex landscape makes the theoretical advantages less relevant.

Why This Matters for ML

SGD and its variants are how neural networks are actually trained:

Mini-batch SGD with momentum is the default training algorithm for CNNs, RNNs, and many other architectures
Batch size affects generalization, training speed, and GPU utilization — it is a key hyperparameter
Learning rate schedules (cosine annealing, warmup) can significantly improve final model quality
Momentum ( $\beta = 0.9$ ) is nearly universal — it accelerates convergence with minimal additional cost
The noise in SGD acts as implicit regularization, helping models generalize
Next: adaptive learning rate methods take this further by adjusting the learning rate per parameter

Summary

SGD uses a random subset of data per step — cheap, noisy, but unbiased
Mini-batch SGD balances variance reduction and computational cost (typical $B = 32$ to $256$ )
Learning rate schedules (step decay, cosine annealing, warmup) reduce $\alpha$ over time for convergence
Momentum ( $\beta = 0.9$ ) accumulates velocity to dampen oscillations and accelerate through valleys
Nesterov momentum improves on standard momentum with a look-ahead gradient
SGD noise is beneficial: it escapes sharp minima and acts as implicit regularization
Shuffling data each epoch prevents systematic biases in gradient estimates
Next: adaptive methods automatically tune per-parameter learning rates

References

Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400-407.
Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. COMPSTAT, 177-186.
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the Importance of Initialization and Momentum in Deep Learning. ICML.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
Smith, S. L., Kindermans, P.-J., Ying, C., & Le, Q. V. (2018). Don’t Decay the Learning Rate, Increase the Batch Size. ICLR.
Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR.