Stochastic Gradient Descent: Trading Precision for Speed

Learn SGD, mini-batch methods, momentum, Nesterov acceleration, and learning rate schedules — the practical optimizers that train modern ML models.

Calculus & Optimization March 7, 2026 9 min read

The Problem with Batch Gradient Descent

Batch gradient descent computes the gradient using every training example. For ImageNet (1.2 million images) or a language model corpus (billions of tokens), a single gradient computation is enormously expensive. Worse, much of the computation is redundant — many samples provide similar gradient information.

We need a way to make progress without processing the entire dataset at every step.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) replaces the full gradient with a gradient computed from a single randomly sampled training example:

θt+1=θtαθ(θt;xi,yi)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}_t; \mathbf{x}_i, y_i)

where (xi,yi)(\mathbf{x}_i, y_i) is drawn uniformly at random from the training set.

The single-sample gradient is an unbiased estimator of the true gradient:

Ei[(θ;xi,yi)]=L(θ)\mathbb{E}_i\left[\nabla \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i)\right] = \nabla \mathcal{L}(\boldsymbol{\theta})

On average, SGD points in the right direction. But any individual step is noisy.

Key insight: SGD trades a precise but expensive gradient for a cheap but noisy estimate. The noise is not a bug — it is a feature. SGD noise helps escape sharp local minima and acts as implicit regularization, often leading to solutions that generalize better.

Noise as a Feature

The stochastic noise in SGD has several beneficial effects:

  • Escaping sharp minima: Sharp minima (high curvature) are unstable under gradient noise — SGD tends to find flatter minima that generalize better
  • Implicit regularization: The noise prevents the model from perfectly fitting individual training examples, acting like a regularizer
  • Exploration: Random fluctuations explore more of the loss landscape than deterministic descent

Mini-Batch Gradient Descent

In practice, we compromise between full-batch and single-sample by using a mini-batch of BB samples:

L1Bj=1B(θ;xj,yj)\nabla \mathcal{L} \approx \frac{1}{B}\sum_{j=1}^{B} \nabla \ell(\boldsymbol{\theta}; \mathbf{x}_j, y_j)

The mini-batch gradient has variance inversely proportional to BB:

Var[1Bj=1Bj]=σ2B\text{Var}\left[\frac{1}{B}\sum_{j=1}^{B} \nabla \ell_j\right] = \frac{\sigma^2}{B}

where σ2\sigma^2 is the per-sample gradient variance. Doubling the batch size halves the variance.

Choosing Batch Size

Batch sizeVarianceGPU utilizationGeneralization
1 (pure SGD)HighPoorGood (lots of noise)
32-128ModerateGoodGood
256-1024LowExcellentUsually good
Full datasetZeroDepends on sizeCan overfit

Common choices are 32, 64, 128, or 256. The sweet spot depends on the hardware (GPU memory), the dataset, and the model.

Key insight: There are diminishing returns to larger batches. Going from B=1B = 1 to B=32B = 32 reduces variance by 32x. Going from B=32B = 32 to B=1024B = 1024 only reduces it by another 32x, while using 32x more computation per step. Smaller batches make more parameter updates per epoch, which often leads to faster wall-clock convergence.

Learning Rate Schedules

With a fixed learning rate, SGD oscillates around the minimum rather than converging to it (the noise prevents settling). Learning rate schedules reduce α\alpha over time to allow convergence.

Constant Learning Rate

The simplest option. Works for getting into the right neighborhood but cannot converge precisely.

Step Decay

Reduce the learning rate by a factor at fixed intervals:

αt=α0γt/s\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t / s \rfloor}

where γ=0.1\gamma = 0.1 and ss is the step interval. This is the classic schedule for training CNNs on ImageNet — reduce by 10x at epochs 30, 60, and 90.

Exponential Decay

αt=α0eλt\alpha_t = \alpha_0 \cdot e^{-\lambda t}

Smooth, continuous reduction. The hyperparameter λ\lambda controls the decay speed.

Cosine Annealing

αt=αmin+12(αmaxαmin)(1+cos(tπT))\alpha_t = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{t \pi}{T}\right)\right)

The learning rate follows a cosine curve from αmax\alpha_{\max} to αmin\alpha_{\min} over TT steps. This spends more time at low learning rates (fine-tuning) than at high rates (exploration).

Warmup

Start with a very small learning rate, linearly ramp up to the target rate over the first few hundred or thousand steps, then apply one of the above schedules.

αt={tTwαmaxt<Twschedule(αmax,tTw)tTw\alpha_t = \begin{cases} \frac{t}{T_w} \alpha_{\max} & t < T_w \\[4pt] \text{schedule}(\alpha_{\max}, t - T_w) & t \geq T_w \end{cases}

Warmup is critical for:

  • Large batch training: Prevents early divergence when the gradient estimate has high variance at initialization
  • Transformers: The Adam optimizer can produce very large updates early in training when the second moment estimates are not yet reliable

Momentum

Vanilla SGD oscillates in directions of high curvature and moves slowly in directions of low curvature. Momentum fixes this by accumulating a velocity vector:

vt+1=βvt+L(θt)θt+1=θtαvt+1\begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla \mathcal{L}(\boldsymbol{\theta}_t) \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \mathbf{v}_{t+1} \end{aligned}

The parameter β[0,1)\beta \in [0, 1) controls how much past gradients influence the current step. Typical value: β=0.9\beta = 0.9.

Intuition: Imagine a heavy ball rolling down a hilly loss surface. Without momentum, it stops and changes direction at every bump. With momentum, it builds up speed in consistent downhill directions and rolls through small bumps. In narrow valleys, oscillations are dampened while forward progress accelerates.

Why Momentum Helps

Consider a loss surface that is steep in one direction and gentle in another (an elongated valley). Without momentum:

  • Large gradients in the steep direction cause oscillation
  • Small gradients in the gentle direction cause slow progress

With momentum:

  • Oscillating gradients in the steep direction cancel out over time (positive and negative contributions average)
  • Consistent gradients in the gentle direction accumulate, building speed

The effective step size in the consistent direction is amplified by a factor of 1/(1β)1/(1 - \beta). With β=0.9\beta = 0.9, this is a 10x speedup.

Nesterov Accelerated Gradient (NAG)

Nesterov momentum makes a subtle but important change: it computes the gradient at the projected position rather than the current position:

vt+1=βvt+L(θtαβvt)θt+1=θtαvt+1\begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla \mathcal{L}(\boldsymbol{\theta}_t - \alpha \beta \mathbf{v}_t) \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \mathbf{v}_{t+1} \end{aligned}

The “look-ahead” gradient allows the algorithm to correct its course before overshooting. Nesterov momentum has provably better convergence rates for convex functions: O(1/t2)O(1/t^2) versus O(1/t)O(1/t) for vanilla gradient descent.

Epochs, Iterations, and Shuffling

  • An epoch is one complete pass through the training dataset
  • An iteration is one parameter update (one mini-batch)
  • Iterations per epoch: n/B\lceil n / B \rceil where nn is the dataset size and BB is batch size

Shuffling the data at the start of each epoch is important. Without shuffling, the model sees data in the same order every epoch, which can create systematic biases in the gradient estimates. Random shuffling ensures each mini-batch is a representative sample.

Worked Example: SGD vs Batch GD

Minimizing f(θ)=1ni=1n(θyi)2f(\theta) = \frac{1}{n}\sum_{i=1}^{n}(\theta - y_i)^2 where yiy_i are data points with mean yˉ=3\bar{y} = 3 and some variance:

import numpy as np

np.random.seed(42)
y = np.random.normal(3.0, 1.0, size=1000)

# Batch gradient descent
theta_batch = 0.0
lr = 0.1
for step in range(50):
    grad = 2 * (theta_batch - y.mean())
    theta_batch -= lr * grad
# theta_batch converges smoothly to ~3.0

# SGD with mini-batch
theta_sgd = 0.0
batch_size = 32
for step in range(50):
    idx = np.random.choice(len(y), batch_size)
    grad = 2 * (theta_sgd - y[idx].mean())
    theta_sgd -= lr * grad
# theta_sgd converges noisily to ~3.0

Batch GD takes a smooth path to the optimum. SGD takes a noisy path but each step is 30x cheaper (32/100032/1000 of the computation). Over a fixed compute budget, SGD typically wins.

Variance Reduction (Brief)

For convex problems, advanced variants of SGD can achieve faster convergence by reducing gradient variance:

  • SVRG (Stochastic Variance-Reduced Gradient): Periodically computes the full gradient and uses it to correct mini-batch estimates
  • SAGA: Maintains a table of per-sample gradients, updating one at a time

These methods achieve linear convergence for strongly convex problems, matching full-batch performance with stochastic cost. However, they are less commonly used in deep learning, where the non-convex landscape makes the theoretical advantages less relevant.

Why This Matters for ML

SGD and its variants are how neural networks are actually trained:

  • Mini-batch SGD with momentum is the default training algorithm for CNNs, RNNs, and many other architectures
  • Batch size affects generalization, training speed, and GPU utilization — it is a key hyperparameter
  • Learning rate schedules (cosine annealing, warmup) can significantly improve final model quality
  • Momentum (β=0.9\beta = 0.9) is nearly universal — it accelerates convergence with minimal additional cost
  • The noise in SGD acts as implicit regularization, helping models generalize
  • Next: adaptive learning rate methods take this further by adjusting the learning rate per parameter

Summary

  • SGD uses a random subset of data per step — cheap, noisy, but unbiased
  • Mini-batch SGD balances variance reduction and computational cost (typical B=32B = 32 to 256256)
  • Learning rate schedules (step decay, cosine annealing, warmup) reduce α\alpha over time for convergence
  • Momentum (β=0.9\beta = 0.9) accumulates velocity to dampen oscillations and accelerate through valleys
  • Nesterov momentum improves on standard momentum with a look-ahead gradient
  • SGD noise is beneficial: it escapes sharp minima and acts as implicit regularization
  • Shuffling data each epoch prevents systematic biases in gradient estimates
  • Next: adaptive methods automatically tune per-parameter learning rates

References

  • Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400-407.
  • Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. COMPSTAT, 177-186.
  • Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the Importance of Initialization and Momentum in Deep Learning. ICML.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
  • Smith, S. L., Kindermans, P.-J., Ying, C., & Le, Q. V. (2018). Don’t Decay the Learning Rate, Increase the Batch Size. ICLR.
  • Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay