- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
The Problem with Batch Gradient Descent
Batch gradient descent computes the gradient using every training example. For ImageNet (1.2 million images) or a language model corpus (billions of tokens), a single gradient computation is enormously expensive. Worse, much of the computation is redundant — many samples provide similar gradient information.
We need a way to make progress without processing the entire dataset at every step.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) replaces the full gradient with a gradient computed from a single randomly sampled training example:
where is drawn uniformly at random from the training set.
The single-sample gradient is an unbiased estimator of the true gradient:
On average, SGD points in the right direction. But any individual step is noisy.
Key insight: SGD trades a precise but expensive gradient for a cheap but noisy estimate. The noise is not a bug — it is a feature. SGD noise helps escape sharp local minima and acts as implicit regularization, often leading to solutions that generalize better.
Noise as a Feature
The stochastic noise in SGD has several beneficial effects:
- Escaping sharp minima: Sharp minima (high curvature) are unstable under gradient noise — SGD tends to find flatter minima that generalize better
- Implicit regularization: The noise prevents the model from perfectly fitting individual training examples, acting like a regularizer
- Exploration: Random fluctuations explore more of the loss landscape than deterministic descent
Mini-Batch Gradient Descent
In practice, we compromise between full-batch and single-sample by using a mini-batch of samples:
The mini-batch gradient has variance inversely proportional to :
where is the per-sample gradient variance. Doubling the batch size halves the variance.
Choosing Batch Size
| Batch size | Variance | GPU utilization | Generalization |
|---|---|---|---|
| 1 (pure SGD) | High | Poor | Good (lots of noise) |
| 32-128 | Moderate | Good | Good |
| 256-1024 | Low | Excellent | Usually good |
| Full dataset | Zero | Depends on size | Can overfit |
Common choices are 32, 64, 128, or 256. The sweet spot depends on the hardware (GPU memory), the dataset, and the model.
Key insight: There are diminishing returns to larger batches. Going from to reduces variance by 32x. Going from to only reduces it by another 32x, while using 32x more computation per step. Smaller batches make more parameter updates per epoch, which often leads to faster wall-clock convergence.
Learning Rate Schedules
With a fixed learning rate, SGD oscillates around the minimum rather than converging to it (the noise prevents settling). Learning rate schedules reduce over time to allow convergence.
Constant Learning Rate
The simplest option. Works for getting into the right neighborhood but cannot converge precisely.
Step Decay
Reduce the learning rate by a factor at fixed intervals:
where and is the step interval. This is the classic schedule for training CNNs on ImageNet — reduce by 10x at epochs 30, 60, and 90.
Exponential Decay
Smooth, continuous reduction. The hyperparameter controls the decay speed.
Cosine Annealing
The learning rate follows a cosine curve from to over steps. This spends more time at low learning rates (fine-tuning) than at high rates (exploration).
Warmup
Start with a very small learning rate, linearly ramp up to the target rate over the first few hundred or thousand steps, then apply one of the above schedules.
Warmup is critical for:
- Large batch training: Prevents early divergence when the gradient estimate has high variance at initialization
- Transformers: The Adam optimizer can produce very large updates early in training when the second moment estimates are not yet reliable
Momentum
Vanilla SGD oscillates in directions of high curvature and moves slowly in directions of low curvature. Momentum fixes this by accumulating a velocity vector:
The parameter controls how much past gradients influence the current step. Typical value: .
Intuition: Imagine a heavy ball rolling down a hilly loss surface. Without momentum, it stops and changes direction at every bump. With momentum, it builds up speed in consistent downhill directions and rolls through small bumps. In narrow valleys, oscillations are dampened while forward progress accelerates.
Why Momentum Helps
Consider a loss surface that is steep in one direction and gentle in another (an elongated valley). Without momentum:
- Large gradients in the steep direction cause oscillation
- Small gradients in the gentle direction cause slow progress
With momentum:
- Oscillating gradients in the steep direction cancel out over time (positive and negative contributions average)
- Consistent gradients in the gentle direction accumulate, building speed
The effective step size in the consistent direction is amplified by a factor of . With , this is a 10x speedup.
Nesterov Accelerated Gradient (NAG)
Nesterov momentum makes a subtle but important change: it computes the gradient at the projected position rather than the current position:
The “look-ahead” gradient allows the algorithm to correct its course before overshooting. Nesterov momentum has provably better convergence rates for convex functions: versus for vanilla gradient descent.
Epochs, Iterations, and Shuffling
- An epoch is one complete pass through the training dataset
- An iteration is one parameter update (one mini-batch)
- Iterations per epoch: where is the dataset size and is batch size
Shuffling the data at the start of each epoch is important. Without shuffling, the model sees data in the same order every epoch, which can create systematic biases in the gradient estimates. Random shuffling ensures each mini-batch is a representative sample.
Worked Example: SGD vs Batch GD
Minimizing where are data points with mean and some variance:
import numpy as np
np.random.seed(42)
y = np.random.normal(3.0, 1.0, size=1000)
# Batch gradient descent
theta_batch = 0.0
lr = 0.1
for step in range(50):
grad = 2 * (theta_batch - y.mean())
theta_batch -= lr * grad
# theta_batch converges smoothly to ~3.0
# SGD with mini-batch
theta_sgd = 0.0
batch_size = 32
for step in range(50):
idx = np.random.choice(len(y), batch_size)
grad = 2 * (theta_sgd - y[idx].mean())
theta_sgd -= lr * grad
# theta_sgd converges noisily to ~3.0
Batch GD takes a smooth path to the optimum. SGD takes a noisy path but each step is 30x cheaper ( of the computation). Over a fixed compute budget, SGD typically wins.
Variance Reduction (Brief)
For convex problems, advanced variants of SGD can achieve faster convergence by reducing gradient variance:
- SVRG (Stochastic Variance-Reduced Gradient): Periodically computes the full gradient and uses it to correct mini-batch estimates
- SAGA: Maintains a table of per-sample gradients, updating one at a time
These methods achieve linear convergence for strongly convex problems, matching full-batch performance with stochastic cost. However, they are less commonly used in deep learning, where the non-convex landscape makes the theoretical advantages less relevant.
Why This Matters for ML
SGD and its variants are how neural networks are actually trained:
- Mini-batch SGD with momentum is the default training algorithm for CNNs, RNNs, and many other architectures
- Batch size affects generalization, training speed, and GPU utilization — it is a key hyperparameter
- Learning rate schedules (cosine annealing, warmup) can significantly improve final model quality
- Momentum () is nearly universal — it accelerates convergence with minimal additional cost
- The noise in SGD acts as implicit regularization, helping models generalize
- Next: adaptive learning rate methods take this further by adjusting the learning rate per parameter
Summary
- SGD uses a random subset of data per step — cheap, noisy, but unbiased
- Mini-batch SGD balances variance reduction and computational cost (typical to )
- Learning rate schedules (step decay, cosine annealing, warmup) reduce over time for convergence
- Momentum () accumulates velocity to dampen oscillations and accelerate through valleys
- Nesterov momentum improves on standard momentum with a look-ahead gradient
- SGD noise is beneficial: it escapes sharp minima and acts as implicit regularization
- Shuffling data each epoch prevents systematic biases in gradient estimates
- Next: adaptive methods automatically tune per-parameter learning rates
References
- Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400-407.
- Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. COMPSTAT, 177-186.
- Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the Importance of Initialization and Momentum in Deep Learning. ICML.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
- Smith, S. L., Kindermans, P.-J., Ying, C., & Le, Q. V. (2018). Don’t Decay the Learning Rate, Increase the Batch Size. ICLR.
- Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR.