Adaptive Learning Rate Methods: From AdaGrad to Adam

Calculus & Optimization Series 8 / 18

The Problem with a Single Learning Rate

SGD with momentum uses the same learning rate for every parameter. But different parameters face very different optimization landscapes:

Embedding layers in NLP have sparse gradients — most entries are zero for any given mini-batch, but when a gradient arrives, it can be large
Batch normalization parameters see gradients on a very different scale than weight matrices
Deep vs shallow layers may need different step sizes due to the gradient magnitude varying across depth

A single $\alpha$ cannot serve all of these needs well. Adaptive methods maintain a separate effective learning rate for each parameter, automatically adjusting based on gradient history.

AdaGrad

AdaGrad (Adaptive Gradient) accumulates the squared gradients for each parameter and scales the learning rate inversely:

\begin{aligned} \mathbf{G}_t &= \mathbf{G}_{t-1} + \nabla \mathcal{L}_t \odot \nabla \mathcal{L}_t \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\mathbf{G}_t + \epsilon}} \odot \nabla \mathcal{L}_t \end{aligned}

where $\odot$ is element-wise multiplication and $\epsilon \approx 10^{-8}$ prevents division by zero.

How it adapts: Parameters that have received large gradients in the past get smaller learning rates. Parameters that have received small (or sparse) gradients maintain larger learning rates.

Key insight: AdaGrad is excellent for sparse data (NLP, recommender systems). Rare features get large learning rates because their accumulated gradient sum is small. Frequent features get small learning rates. This automatic scaling eliminates the need to manually tune learning rates for sparse vs dense features.

The problem: The accumulated sum $\mathbf{G}_t$ only grows, so the learning rate monotonically decreases. Eventually, the effective learning rate becomes so small that the model stops learning — even if it has not converged.

RMSProp

RMSProp (Root Mean Square Propagation) fixes AdaGrad’s diminishing learning rate by using an exponential moving average instead of a cumulative sum:

\begin{aligned} \mathbf{v}_t &= \beta \mathbf{v}_{t-1} + (1 - \beta) \nabla \mathcal{L}_t \odot \nabla \mathcal{L}_t \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\mathbf{v}_t + \epsilon}} \odot \nabla \mathcal{L}_t \end{aligned}

The decay factor $\beta$ (typically 0.999) controls the window of gradient history. Old gradients are exponentially forgotten, allowing the learning rate to increase again if recent gradients are small.

Historical note: RMSProp was proposed by Geoffrey Hinton in his Coursera lectures (2012) and was never formally published in a paper. Despite this, it became one of the most widely used optimizers, demonstrating how practical impact can precede formal publication in ML.

Adam (Adaptive Moment Estimation)

Adam combines the best of momentum (first moment) and RMSProp (second moment):

First moment (mean of gradients — like momentum):

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \nabla \mathcal{L}_t

Second moment (mean of squared gradients — like RMSProp):

\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \nabla \mathcal{L}_t \odot \nabla \mathcal{L}_t

Bias correction (compensates for zero initialization):

\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t} \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}

Update:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \odot \hat{\mathbf{m}}_t

Default Hyperparameters

The original paper recommends:

$\beta_1 = 0.9$ (momentum decay)
$\beta_2 = 0.999$ (squared gradient decay)
$\epsilon = 10^{-8}$ (numerical stability)
$\alpha = 0.001$ (learning rate)

These defaults work remarkably well across a wide range of problems, which is a major reason for Adam’s popularity.

Why Bias Correction Matters

At initialization, $\mathbf{m}_0 = \mathbf{0}$ and $\mathbf{v}_0 = \mathbf{0}$ . After one step with $\beta_1 = 0.9$ :

\mathbf{m}_1 = 0.1 \cdot \nabla \mathcal{L}_1

This is biased toward zero — the true gradient is $\nabla \mathcal{L}_1$ , but the estimate is only $0.1 \cdot \nabla \mathcal{L}_1$ . Taking the expectation:

\mathbb{E}[\mathbf{m}_t] = (1 - \beta_1^t) \cdot \mathbb{E}[\nabla \mathcal{L}]

Dividing by $(1 - \beta_1^t)$ removes this bias. The correction is significant early in training and becomes negligible as $t$ grows (since $\beta_1^t \to 0$ ).

Key insight: Adam adapts the learning rate per parameter using both the direction (first moment) and magnitude (second moment) of recent gradients. Large, consistent gradients get smaller learning rates. Small or noisy gradients get relatively larger learning rates. This automatic tuning makes Adam much less sensitive to the initial learning rate choice than SGD.

Interpreting Adam’s Update

The Adam update can be rewritten as:

\Delta \theta_i \propto \frac{\hat{m}_{t,i}}{\sqrt{\hat{v}_{t,i}}}

The numerator is the smoothed gradient direction. The denominator normalizes by the gradient’s typical magnitude. The ratio is approximately a sign operation: the update magnitude is roughly $\alpha$ regardless of gradient scale, with direction determined by the sign of the smoothed gradient. This makes Adam robust to gradient scale variations across parameters.

AdamW (Decoupled Weight Decay)

Standard Adam applies weight decay through the gradient:

\nabla \mathcal{L}_{\text{reg}} = \nabla \mathcal{L} + \lambda \boldsymbol{\theta}

But the adaptive scaling then modifies the effective weight decay per parameter, which is not what we want. AdamW decouples weight decay from the gradient-based update:

\boldsymbol{\theta}_{t+1} = (1 - \alpha\lambda)\boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \odot \hat{\mathbf{m}}_t

The weight decay $\lambda$ acts directly on the parameters, independent of the adaptive learning rate. This simple change significantly improves generalization and is now the standard version of Adam used in practice.

Key distinction: In Adam, L2 regularization and weight decay produce different results because the adaptive scaling interferes with regularization. AdamW applies weight decay separately, restoring the correct regularization behavior. Always prefer AdamW over Adam with L2 regularization.

Other Adam Variants

AMSGrad

Adam can sometimes fail to converge because the second moment estimate $\hat{\mathbf{v}}_t$ can decrease, causing the learning rate to increase at the wrong time. AMSGrad fixes this by maintaining the maximum of past second moments:

\hat{\mathbf{v}}_t^{\text{AMS}} = \max(\hat{\mathbf{v}}_{t-1}^{\text{AMS}}, \hat{\mathbf{v}}_t)

This guarantees a non-increasing learning rate, ensuring convergence. In practice, the improvement over Adam is often marginal.

RAdam (Rectified Adam)

RAdam dynamically adjusts the momentum term based on the variance of the adaptive learning rate. When the variance is high (early in training, when the second moment estimate is unreliable), RAdam reduces the momentum contribution, effectively acting like SGD with momentum. As training progresses and variance decreases, it transitions to full Adam behavior.

This eliminates the need for learning rate warmup, which is otherwise essential for Adam on some tasks.

Comparison of Optimizers

Method	Update rule (simplified)	Pros	Cons	Best for
SGD	$\theta - \alpha \nabla \mathcal{L}$	Simple, good generalization	Slow, sensitive to $\alpha$	With tuning: vision
SGD+Momentum	$\theta - \alpha(\beta v + \nabla \mathcal{L})$	Faster, dampens oscillations	Still needs schedule	Vision benchmarks
AdaGrad	$\theta - \frac{\alpha}{\sqrt{G}} \nabla \mathcal{L}$	Great for sparse data	Learning rate dies	Sparse NLP features
RMSProp	$\theta - \frac{\alpha}{\sqrt{v}} \nabla \mathcal{L}$	Fixes AdaGrad decay	Less studied theory	RNNs
Adam	$\theta - \frac{\alpha}{\sqrt{\hat{v}}} \hat{m}$	Robust, fast, low tuning	Can generalize worse	Default choice
AdamW	Adam + decoupled decay	Best regularization	Slightly more complex	Transformers, LLMs

Practical Guidelines

Which Optimizer to Choose?

Start with AdamW ( $\alpha = 10^{-3}$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ ). It works well on most tasks with minimal tuning.
For vision models where you can afford hyperparameter tuning, SGD with momentum ( $\alpha = 0.1$ , $\beta = 0.9$ ) plus cosine annealing or step decay often achieves slightly better final accuracy.
For transformers and LLMs, AdamW with warmup + cosine decay is the standard recipe.
For sparse data (embeddings, recommender systems), AdaGrad or Adam handles the sparsity automatically.

Common Hyperparameter Settings

Optimizer	Learning rate	Other
SGD+Momentum	$0.01$ to $0.1$	$\beta = 0.9$ , weight decay $= 10^{-4}$
Adam	$10^{-4}$ to $10^{-3}$	$\beta_1 = 0.9$ , $\beta_2 = 0.999$
AdamW	$10^{-4}$ to $10^{-3}$	Same + weight decay $= 0.01$ to $0.1$

Debugging Training Issues

Loss not decreasing: Learning rate likely too high. Reduce by 3-10x.
Loss decreasing very slowly: Learning rate too low, or optimizer stuck in a flat region. Try warmup.
Loss oscillates wildly: Reduce learning rate or increase batch size.
Loss plateaus then drops: Normal for step-decay schedules. For smooth schedules, consider cosine annealing.
NaN loss: Gradient explosion. Add gradient clipping (torch.nn.utils.clip_grad_norm_).

Worked Example: Comparing Optimizers

import numpy as np

def rosenbrock_grad(x, y):
    """Gradient of f(x,y) = (1-x)^2 + 100(y-x^2)^2"""
    dx = -2*(1-x) - 400*x*(y - x**2)
    dy = 200*(y - x**2)
    return np.array([dx, dy])

# SGD
theta = np.array([-1.0, 1.0])
for t in range(5000):
    theta -= 0.001 * rosenbrock_grad(*theta)
print(f"SGD: {theta}")  # still far from (1,1)

# Adam
theta = np.array([-1.0, 1.0])
m, v = np.zeros(2), np.zeros(2)
for t in range(1, 5001):
    g = rosenbrock_grad(*theta)
    m = 0.9*m + 0.1*g
    v = 0.999*v + 0.001*g**2
    m_hat = m / (1 - 0.9**t)
    v_hat = v / (1 - 0.999**t)
    theta -= 0.001 * m_hat / (np.sqrt(v_hat) + 1e-8)
print(f"Adam: {theta}")  # much closer to (1,1)

On the Rosenbrock function — a notoriously ill-conditioned test problem — Adam converges significantly faster than plain SGD because its per-parameter scaling handles the vastly different curvatures in the $x$ and $y$ directions.

Why This Matters for ML

Adaptive optimizers are the engines of modern deep learning:

Adam/AdamW is the default optimizer for most tasks, especially NLP and generative models
Per-parameter learning rates handle the heterogeneous loss landscape of deep networks automatically
Understanding these methods helps diagnose training issues: is the problem the learning rate, the optimizer, or the model?
The choice between SGD and Adam reflects a fundamental trade-off between tuning effort and final performance
These optimizers build on all the calculus we have covered: gradients, the chain rule, and gradient descent

Summary

AdaGrad scales learning rates by accumulated squared gradients — great for sparse data, but learning rate decays to zero
RMSProp fixes AdaGrad with exponential moving averages of squared gradients
Adam combines momentum (first moment) with RMSProp (second moment) plus bias correction
AdamW decouples weight decay from the adaptive update — the standard choice for transformers
Adam’s defaults ( $\alpha = 10^{-3}$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ ) work well across most problems
SGD+momentum can achieve better final performance with careful tuning, especially in vision
Next: constrained optimization handles optimization with constraints

References

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR, 12, 2121-2159.
Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR. arXiv:1412.6980
Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR. arXiv:1711.05101
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms. arXiv:1609.04747