- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
The Problem with a Single Learning Rate
SGD with momentum uses the same learning rate for every parameter. But different parameters face very different optimization landscapes:
- Embedding layers in NLP have sparse gradients — most entries are zero for any given mini-batch, but when a gradient arrives, it can be large
- Batch normalization parameters see gradients on a very different scale than weight matrices
- Deep vs shallow layers may need different step sizes due to the gradient magnitude varying across depth
A single cannot serve all of these needs well. Adaptive methods maintain a separate effective learning rate for each parameter, automatically adjusting based on gradient history.
AdaGrad
AdaGrad (Adaptive Gradient) accumulates the squared gradients for each parameter and scales the learning rate inversely:
where is element-wise multiplication and prevents division by zero.
How it adapts: Parameters that have received large gradients in the past get smaller learning rates. Parameters that have received small (or sparse) gradients maintain larger learning rates.
Key insight: AdaGrad is excellent for sparse data (NLP, recommender systems). Rare features get large learning rates because their accumulated gradient sum is small. Frequent features get small learning rates. This automatic scaling eliminates the need to manually tune learning rates for sparse vs dense features.
The problem: The accumulated sum only grows, so the learning rate monotonically decreases. Eventually, the effective learning rate becomes so small that the model stops learning — even if it has not converged.
RMSProp
RMSProp (Root Mean Square Propagation) fixes AdaGrad’s diminishing learning rate by using an exponential moving average instead of a cumulative sum:
The decay factor (typically 0.999) controls the window of gradient history. Old gradients are exponentially forgotten, allowing the learning rate to increase again if recent gradients are small.
Historical note: RMSProp was proposed by Geoffrey Hinton in his Coursera lectures (2012) and was never formally published in a paper. Despite this, it became one of the most widely used optimizers, demonstrating how practical impact can precede formal publication in ML.
Adam (Adaptive Moment Estimation)
Adam combines the best of momentum (first moment) and RMSProp (second moment):
First moment (mean of gradients — like momentum):
Second moment (mean of squared gradients — like RMSProp):
Bias correction (compensates for zero initialization):
Update:
Default Hyperparameters
The original paper recommends:
- (momentum decay)
- (squared gradient decay)
- (numerical stability)
- (learning rate)
These defaults work remarkably well across a wide range of problems, which is a major reason for Adam’s popularity.
Why Bias Correction Matters
At initialization, and . After one step with :
This is biased toward zero — the true gradient is , but the estimate is only . Taking the expectation:
Dividing by removes this bias. The correction is significant early in training and becomes negligible as grows (since ).
Key insight: Adam adapts the learning rate per parameter using both the direction (first moment) and magnitude (second moment) of recent gradients. Large, consistent gradients get smaller learning rates. Small or noisy gradients get relatively larger learning rates. This automatic tuning makes Adam much less sensitive to the initial learning rate choice than SGD.
Interpreting Adam’s Update
The Adam update can be rewritten as:
The numerator is the smoothed gradient direction. The denominator normalizes by the gradient’s typical magnitude. The ratio is approximately a sign operation: the update magnitude is roughly regardless of gradient scale, with direction determined by the sign of the smoothed gradient. This makes Adam robust to gradient scale variations across parameters.
AdamW (Decoupled Weight Decay)
Standard Adam applies weight decay through the gradient:
But the adaptive scaling then modifies the effective weight decay per parameter, which is not what we want. AdamW decouples weight decay from the gradient-based update:
The weight decay acts directly on the parameters, independent of the adaptive learning rate. This simple change significantly improves generalization and is now the standard version of Adam used in practice.
Key distinction: In Adam, L2 regularization and weight decay produce different results because the adaptive scaling interferes with regularization. AdamW applies weight decay separately, restoring the correct regularization behavior. Always prefer AdamW over Adam with L2 regularization.
Other Adam Variants
AMSGrad
Adam can sometimes fail to converge because the second moment estimate can decrease, causing the learning rate to increase at the wrong time. AMSGrad fixes this by maintaining the maximum of past second moments:
This guarantees a non-increasing learning rate, ensuring convergence. In practice, the improvement over Adam is often marginal.
RAdam (Rectified Adam)
RAdam dynamically adjusts the momentum term based on the variance of the adaptive learning rate. When the variance is high (early in training, when the second moment estimate is unreliable), RAdam reduces the momentum contribution, effectively acting like SGD with momentum. As training progresses and variance decreases, it transitions to full Adam behavior.
This eliminates the need for learning rate warmup, which is otherwise essential for Adam on some tasks.
Comparison of Optimizers
| Method | Update rule (simplified) | Pros | Cons | Best for |
|---|---|---|---|---|
| SGD | Simple, good generalization | Slow, sensitive to | With tuning: vision | |
| SGD+Momentum | Faster, dampens oscillations | Still needs schedule | Vision benchmarks | |
| AdaGrad | Great for sparse data | Learning rate dies | Sparse NLP features | |
| RMSProp | Fixes AdaGrad decay | Less studied theory | RNNs | |
| Adam | Robust, fast, low tuning | Can generalize worse | Default choice | |
| AdamW | Adam + decoupled decay | Best regularization | Slightly more complex | Transformers, LLMs |
Practical Guidelines
Which Optimizer to Choose?
-
Start with AdamW (, , ). It works well on most tasks with minimal tuning.
-
For vision models where you can afford hyperparameter tuning, SGD with momentum (, ) plus cosine annealing or step decay often achieves slightly better final accuracy.
-
For transformers and LLMs, AdamW with warmup + cosine decay is the standard recipe.
-
For sparse data (embeddings, recommender systems), AdaGrad or Adam handles the sparsity automatically.
Common Hyperparameter Settings
| Optimizer | Learning rate | Other |
|---|---|---|
| SGD+Momentum | to | , weight decay |
| Adam | to | , |
| AdamW | to | Same + weight decay to |
Debugging Training Issues
- Loss not decreasing: Learning rate likely too high. Reduce by 3-10x.
- Loss decreasing very slowly: Learning rate too low, or optimizer stuck in a flat region. Try warmup.
- Loss oscillates wildly: Reduce learning rate or increase batch size.
- Loss plateaus then drops: Normal for step-decay schedules. For smooth schedules, consider cosine annealing.
- NaN loss: Gradient explosion. Add gradient clipping (
torch.nn.utils.clip_grad_norm_).
Worked Example: Comparing Optimizers
import numpy as np
def rosenbrock_grad(x, y):
"""Gradient of f(x,y) = (1-x)^2 + 100(y-x^2)^2"""
dx = -2*(1-x) - 400*x*(y - x**2)
dy = 200*(y - x**2)
return np.array([dx, dy])
# SGD
theta = np.array([-1.0, 1.0])
for t in range(5000):
theta -= 0.001 * rosenbrock_grad(*theta)
print(f"SGD: {theta}") # still far from (1,1)
# Adam
theta = np.array([-1.0, 1.0])
m, v = np.zeros(2), np.zeros(2)
for t in range(1, 5001):
g = rosenbrock_grad(*theta)
m = 0.9*m + 0.1*g
v = 0.999*v + 0.001*g**2
m_hat = m / (1 - 0.9**t)
v_hat = v / (1 - 0.999**t)
theta -= 0.001 * m_hat / (np.sqrt(v_hat) + 1e-8)
print(f"Adam: {theta}") # much closer to (1,1)
On the Rosenbrock function — a notoriously ill-conditioned test problem — Adam converges significantly faster than plain SGD because its per-parameter scaling handles the vastly different curvatures in the and directions.
Why This Matters for ML
Adaptive optimizers are the engines of modern deep learning:
- Adam/AdamW is the default optimizer for most tasks, especially NLP and generative models
- Per-parameter learning rates handle the heterogeneous loss landscape of deep networks automatically
- Understanding these methods helps diagnose training issues: is the problem the learning rate, the optimizer, or the model?
- The choice between SGD and Adam reflects a fundamental trade-off between tuning effort and final performance
- These optimizers build on all the calculus we have covered: gradients, the chain rule, and gradient descent
Summary
- AdaGrad scales learning rates by accumulated squared gradients — great for sparse data, but learning rate decays to zero
- RMSProp fixes AdaGrad with exponential moving averages of squared gradients
- Adam combines momentum (first moment) with RMSProp (second moment) plus bias correction
- AdamW decouples weight decay from the adaptive update — the standard choice for transformers
- Adam’s defaults (, , ) work well across most problems
- SGD+momentum can achieve better final performance with careful tuning, especially in vision
- Next: constrained optimization handles optimization with constraints
References
- Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR, 12, 2121-2159.
- Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR. arXiv:1412.6980
- Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR. arXiv:1711.05101
- Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
- Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms. arXiv:1609.04747