Adaptive Learning Rate Methods: From AdaGrad to Adam

Understand AdaGrad, RMSProp, Adam, and AdamW — adaptive optimizers that tune per-parameter learning rates for faster, more robust training.

Calculus & Optimization March 7, 2026 10 min read

The Problem with a Single Learning Rate

SGD with momentum uses the same learning rate for every parameter. But different parameters face very different optimization landscapes:

  • Embedding layers in NLP have sparse gradients — most entries are zero for any given mini-batch, but when a gradient arrives, it can be large
  • Batch normalization parameters see gradients on a very different scale than weight matrices
  • Deep vs shallow layers may need different step sizes due to the gradient magnitude varying across depth

A single α\alpha cannot serve all of these needs well. Adaptive methods maintain a separate effective learning rate for each parameter, automatically adjusting based on gradient history.

AdaGrad

AdaGrad (Adaptive Gradient) accumulates the squared gradients for each parameter and scales the learning rate inversely:

Gt=Gt1+LtLtθt+1=θtαGt+ϵLt\begin{aligned} \mathbf{G}_t &= \mathbf{G}_{t-1} + \nabla \mathcal{L}_t \odot \nabla \mathcal{L}_t \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\mathbf{G}_t + \epsilon}} \odot \nabla \mathcal{L}_t \end{aligned}

where \odot is element-wise multiplication and ϵ108\epsilon \approx 10^{-8} prevents division by zero.

How it adapts: Parameters that have received large gradients in the past get smaller learning rates. Parameters that have received small (or sparse) gradients maintain larger learning rates.

Key insight: AdaGrad is excellent for sparse data (NLP, recommender systems). Rare features get large learning rates because their accumulated gradient sum is small. Frequent features get small learning rates. This automatic scaling eliminates the need to manually tune learning rates for sparse vs dense features.

The problem: The accumulated sum Gt\mathbf{G}_t only grows, so the learning rate monotonically decreases. Eventually, the effective learning rate becomes so small that the model stops learning — even if it has not converged.

RMSProp

RMSProp (Root Mean Square Propagation) fixes AdaGrad’s diminishing learning rate by using an exponential moving average instead of a cumulative sum:

vt=βvt1+(1β)LtLtθt+1=θtαvt+ϵLt\begin{aligned} \mathbf{v}_t &= \beta \mathbf{v}_{t-1} + (1 - \beta) \nabla \mathcal{L}_t \odot \nabla \mathcal{L}_t \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\mathbf{v}_t + \epsilon}} \odot \nabla \mathcal{L}_t \end{aligned}

The decay factor β\beta (typically 0.999) controls the window of gradient history. Old gradients are exponentially forgotten, allowing the learning rate to increase again if recent gradients are small.

Historical note: RMSProp was proposed by Geoffrey Hinton in his Coursera lectures (2012) and was never formally published in a paper. Despite this, it became one of the most widely used optimizers, demonstrating how practical impact can precede formal publication in ML.

Adam (Adaptive Moment Estimation)

Adam combines the best of momentum (first moment) and RMSProp (second moment):

First moment (mean of gradients — like momentum):

mt=β1mt1+(1β1)Lt\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \nabla \mathcal{L}_t

Second moment (mean of squared gradients — like RMSProp):

vt=β2vt1+(1β2)LtLt\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \nabla \mathcal{L}_t \odot \nabla \mathcal{L}_t

Bias correction (compensates for zero initialization):

m^t=mt1β1tv^t=vt1β2t\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t} \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}

Update:

θt+1=θtαv^t+ϵm^t\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \odot \hat{\mathbf{m}}_t

Default Hyperparameters

The original paper recommends:

  • β1=0.9\beta_1 = 0.9 (momentum decay)
  • β2=0.999\beta_2 = 0.999 (squared gradient decay)
  • ϵ=108\epsilon = 10^{-8} (numerical stability)
  • α=0.001\alpha = 0.001 (learning rate)

These defaults work remarkably well across a wide range of problems, which is a major reason for Adam’s popularity.

Why Bias Correction Matters

At initialization, m0=0\mathbf{m}_0 = \mathbf{0} and v0=0\mathbf{v}_0 = \mathbf{0}. After one step with β1=0.9\beta_1 = 0.9:

m1=0.1L1\mathbf{m}_1 = 0.1 \cdot \nabla \mathcal{L}_1

This is biased toward zero — the true gradient is L1\nabla \mathcal{L}_1, but the estimate is only 0.1L10.1 \cdot \nabla \mathcal{L}_1. Taking the expectation:

E[mt]=(1β1t)E[L]\mathbb{E}[\mathbf{m}_t] = (1 - \beta_1^t) \cdot \mathbb{E}[\nabla \mathcal{L}]

Dividing by (1β1t)(1 - \beta_1^t) removes this bias. The correction is significant early in training and becomes negligible as tt grows (since β1t0\beta_1^t \to 0).

Key insight: Adam adapts the learning rate per parameter using both the direction (first moment) and magnitude (second moment) of recent gradients. Large, consistent gradients get smaller learning rates. Small or noisy gradients get relatively larger learning rates. This automatic tuning makes Adam much less sensitive to the initial learning rate choice than SGD.

Interpreting Adam’s Update

The Adam update can be rewritten as:

Δθim^t,iv^t,i\Delta \theta_i \propto \frac{\hat{m}_{t,i}}{\sqrt{\hat{v}_{t,i}}}

The numerator is the smoothed gradient direction. The denominator normalizes by the gradient’s typical magnitude. The ratio is approximately a sign operation: the update magnitude is roughly α\alpha regardless of gradient scale, with direction determined by the sign of the smoothed gradient. This makes Adam robust to gradient scale variations across parameters.

AdamW (Decoupled Weight Decay)

Standard Adam applies weight decay through the gradient:

Lreg=L+λθ\nabla \mathcal{L}_{\text{reg}} = \nabla \mathcal{L} + \lambda \boldsymbol{\theta}

But the adaptive scaling then modifies the effective weight decay per parameter, which is not what we want. AdamW decouples weight decay from the gradient-based update:

θt+1=(1αλ)θtαv^t+ϵm^t\boldsymbol{\theta}_{t+1} = (1 - \alpha\lambda)\boldsymbol{\theta}_t - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \odot \hat{\mathbf{m}}_t

The weight decay λ\lambda acts directly on the parameters, independent of the adaptive learning rate. This simple change significantly improves generalization and is now the standard version of Adam used in practice.

Key distinction: In Adam, L2 regularization and weight decay produce different results because the adaptive scaling interferes with regularization. AdamW applies weight decay separately, restoring the correct regularization behavior. Always prefer AdamW over Adam with L2 regularization.

Other Adam Variants

AMSGrad

Adam can sometimes fail to converge because the second moment estimate v^t\hat{\mathbf{v}}_t can decrease, causing the learning rate to increase at the wrong time. AMSGrad fixes this by maintaining the maximum of past second moments:

v^tAMS=max(v^t1AMS,v^t)\hat{\mathbf{v}}_t^{\text{AMS}} = \max(\hat{\mathbf{v}}_{t-1}^{\text{AMS}}, \hat{\mathbf{v}}_t)

This guarantees a non-increasing learning rate, ensuring convergence. In practice, the improvement over Adam is often marginal.

RAdam (Rectified Adam)

RAdam dynamically adjusts the momentum term based on the variance of the adaptive learning rate. When the variance is high (early in training, when the second moment estimate is unreliable), RAdam reduces the momentum contribution, effectively acting like SGD with momentum. As training progresses and variance decreases, it transitions to full Adam behavior.

This eliminates the need for learning rate warmup, which is otherwise essential for Adam on some tasks.

Comparison of Optimizers

MethodUpdate rule (simplified)ProsConsBest for
SGDθαL\theta - \alpha \nabla \mathcal{L}Simple, good generalizationSlow, sensitive to α\alphaWith tuning: vision
SGD+Momentumθα(βv+L)\theta - \alpha(\beta v + \nabla \mathcal{L})Faster, dampens oscillationsStill needs scheduleVision benchmarks
AdaGradθαGL\theta - \frac{\alpha}{\sqrt{G}} \nabla \mathcal{L}Great for sparse dataLearning rate diesSparse NLP features
RMSPropθαvL\theta - \frac{\alpha}{\sqrt{v}} \nabla \mathcal{L}Fixes AdaGrad decayLess studied theoryRNNs
Adamθαv^m^\theta - \frac{\alpha}{\sqrt{\hat{v}}} \hat{m}Robust, fast, low tuningCan generalize worseDefault choice
AdamWAdam + decoupled decayBest regularizationSlightly more complexTransformers, LLMs

Practical Guidelines

Which Optimizer to Choose?

  1. Start with AdamW (α=103\alpha = 10^{-3}, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999). It works well on most tasks with minimal tuning.

  2. For vision models where you can afford hyperparameter tuning, SGD with momentum (α=0.1\alpha = 0.1, β=0.9\beta = 0.9) plus cosine annealing or step decay often achieves slightly better final accuracy.

  3. For transformers and LLMs, AdamW with warmup + cosine decay is the standard recipe.

  4. For sparse data (embeddings, recommender systems), AdaGrad or Adam handles the sparsity automatically.

Common Hyperparameter Settings

OptimizerLearning rateOther
SGD+Momentum0.010.01 to 0.10.1β=0.9\beta = 0.9, weight decay =104= 10^{-4}
Adam10410^{-4} to 10310^{-3}β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999
AdamW10410^{-4} to 10310^{-3}Same + weight decay =0.01= 0.01 to 0.10.1

Debugging Training Issues

  • Loss not decreasing: Learning rate likely too high. Reduce by 3-10x.
  • Loss decreasing very slowly: Learning rate too low, or optimizer stuck in a flat region. Try warmup.
  • Loss oscillates wildly: Reduce learning rate or increase batch size.
  • Loss plateaus then drops: Normal for step-decay schedules. For smooth schedules, consider cosine annealing.
  • NaN loss: Gradient explosion. Add gradient clipping (torch.nn.utils.clip_grad_norm_).

Worked Example: Comparing Optimizers

import numpy as np

def rosenbrock_grad(x, y):
    """Gradient of f(x,y) = (1-x)^2 + 100(y-x^2)^2"""
    dx = -2*(1-x) - 400*x*(y - x**2)
    dy = 200*(y - x**2)
    return np.array([dx, dy])

# SGD
theta = np.array([-1.0, 1.0])
for t in range(5000):
    theta -= 0.001 * rosenbrock_grad(*theta)
print(f"SGD: {theta}")  # still far from (1,1)

# Adam
theta = np.array([-1.0, 1.0])
m, v = np.zeros(2), np.zeros(2)
for t in range(1, 5001):
    g = rosenbrock_grad(*theta)
    m = 0.9*m + 0.1*g
    v = 0.999*v + 0.001*g**2
    m_hat = m / (1 - 0.9**t)
    v_hat = v / (1 - 0.999**t)
    theta -= 0.001 * m_hat / (np.sqrt(v_hat) + 1e-8)
print(f"Adam: {theta}")  # much closer to (1,1)

On the Rosenbrock function — a notoriously ill-conditioned test problem — Adam converges significantly faster than plain SGD because its per-parameter scaling handles the vastly different curvatures in the xx and yy directions.

Why This Matters for ML

Adaptive optimizers are the engines of modern deep learning:

  • Adam/AdamW is the default optimizer for most tasks, especially NLP and generative models
  • Per-parameter learning rates handle the heterogeneous loss landscape of deep networks automatically
  • Understanding these methods helps diagnose training issues: is the problem the learning rate, the optimizer, or the model?
  • The choice between SGD and Adam reflects a fundamental trade-off between tuning effort and final performance
  • These optimizers build on all the calculus we have covered: gradients, the chain rule, and gradient descent

Summary

  • AdaGrad scales learning rates by accumulated squared gradients — great for sparse data, but learning rate decays to zero
  • RMSProp fixes AdaGrad with exponential moving averages of squared gradients
  • Adam combines momentum (first moment) with RMSProp (second moment) plus bias correction
  • AdamW decouples weight decay from the adaptive update — the standard choice for transformers
  • Adam’s defaults (α=103\alpha = 10^{-3}, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999) work well across most problems
  • SGD+momentum can achieve better final performance with careful tuning, especially in vision
  • Next: constrained optimization handles optimization with constraints

References

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR, 12, 2121-2159.
  • Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR. arXiv:1412.6980
  • Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR. arXiv:1711.05101
  • Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
  • Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms. arXiv:1609.04747

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay