- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
The Gap Between Theory and Practice
The optimization algorithms we have covered — gradient descent, SGD, Adam — are mathematically elegant. But computers use finite-precision arithmetic, and this introduces errors that can silently corrupt training or cause spectacular failures.
Understanding numerical stability is the difference between a model that trains and one that produces NaN losses.
Floating-Point Arithmetic
Computers represent real numbers in floating-point format:
where is the mantissa (significand) and is the exponent. The precision is finite:
| Format | Bits | Mantissa bits | Range | Precision |
|---|---|---|---|---|
| FP64 (double) | 64 | 52 | ~16 decimal digits | |
| FP32 (float) | 32 | 23 | ~7 decimal digits | |
| FP16 (half) | 16 | 10 | ~3 decimal digits | |
| BF16 (bfloat16) | 16 | 7 | ~2 decimal digits |
Sources of Error
- Overflow: Result exceeds the maximum representable value (). Example: in FP32.
- Underflow: Result is smaller than the minimum representable value (). Example: in FP32.
- Catastrophic cancellation: Subtracting nearly equal numbers destroys significant digits. Example: in FP64 loses 15 digits of precision.
The Log-Sum-Exp Trick
Computing softmax probabilities directly is numerically dangerous. If any is large (e.g., 1000), overflows. If all are very negative, the denominator underflows to zero.
The log-sum-exp (LSE) trick subtracts the maximum value before exponentiating:
where . Since , no exponential overflows. And the largest term , so the sum is at least 1 — no underflow.
def log_sum_exp(z):
c = z.max()
return c + np.log(np.sum(np.exp(z - c)))
def softmax(z):
c = z.max()
exp_z = np.exp(z - c)
return exp_z / exp_z.sum()
Key insight: The log-sum-exp trick is not optional — it is required for numerical correctness. Every deep learning framework implements softmax this way internally. Whenever you see in a derivation, this trick should be applied in implementation.
Cross-Entropy with LogSoftmax
Computing cross-entropy loss from probabilities introduces another instability: where might be very close to zero.
The solution is to compute log-softmax directly:
This avoids ever computing explicitly. PyTorch’s F.cross_entropy takes raw logits (not probabilities) precisely for this reason.
Gradient Clipping
When gradients become very large (exploding gradients), the parameter update overshoots catastrophically. Gradient clipping caps the gradient magnitude.
Gradient Norm Clipping
The most common approach clips the global gradient norm:
This preserves the gradient direction but limits its magnitude to .
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Gradient Value Clipping
An alternative clips each gradient element independently to . This changes the gradient direction, which can be problematic, but is simpler.
When to Clip
- RNNs/LSTMs: Long sequences amplify gradients through time. Clipping at to is standard.
- Transformers: Gradient clipping at is nearly universal in LLM training.
- GAN training: Clipping stabilizes the discriminator/generator interplay.
- Very deep networks: Without residual connections, gradients can explode through many layers.
Key insight: Gradient clipping is a safety net, not a solution. If you need aggressive clipping (small ), something else is likely wrong — initialization, learning rate, or architecture. But mild clipping () is a good default practice that prevents rare catastrophic updates without affecting normal training.
Mixed Precision Training
Mixed precision uses FP16 (or BF16) for most computations and FP32 for critical accumulations, achieving 2-3x speedup on modern GPUs.
The Challenge
FP16 has only 3 decimal digits of precision and a maximum value of 65504. Gradients can easily underflow (small gradients become zero) or weights can overflow.
Loss Scaling
Loss scaling multiplies the loss by a large factor before the backward pass. This scales all gradients by , lifting small gradients out of the underflow zone. After gradient computation, gradients are divided by before the optimizer step.
Dynamic loss scaling automatically adjusts : increase when no overflow is detected, decrease when overflow occurs.
BFloat16 vs Float16
| Property | FP16 | BF16 |
|---|---|---|
| Exponent bits | 5 | 8 |
| Mantissa bits | 10 | 7 |
| Range | ||
| Precision | Higher | Lower |
| Loss scaling needed? | Yes | Usually not |
BF16 has the same range as FP32 (8 exponent bits), which eliminates most overflow/underflow issues. It is becoming the preferred format for LLM training.
Numerical Issues in Specific Operations
Log of Small Probabilities
Computing when suffers from catastrophic cancellation. Use log1p:
# Bad: catastrophic cancellation when p ≈ 1
result = np.log(1 - p)
# Good: numerically stable
result = np.log1p(-p)
Similarly, expm1(x) computes accurately for small .
Sigmoid and Tanh Saturation
The sigmoid saturates for large :
- : , gradient
- : , gradient
This is the vanishing gradient problem for sigmoid/tanh networks. The numerical manifestation is that FP16 rounds to exactly 0 or 1 for moderate , making the gradient exactly zero.
Batch Normalization Numerical Issues
Batch normalization divides by . If the batch size is very small, the variance estimate can be near zero, causing division instability. The term (typically ) prevents this.
Attention Score Overflow
In transformers, attention scores can overflow in FP16 before the softmax. The scaling prevents this for typical dimensions, but very large or unnormalized queries/keys can still cause issues.
Gradient Checkpointing
Gradient checkpointing (activation recomputation) is not strictly a stability technique but addresses the memory bottleneck of storing all activations for backpropagation.
Instead of storing all intermediate activations during the forward pass, only store activations at selected checkpoints. During the backward pass, recompute intermediate activations from the nearest checkpoint.
Trade-off: reduces memory from to for layers, at the cost of one extra forward pass (2x compute for a memory reduction).
This enables training much deeper or larger models on limited GPU memory.
Debugging Numerical Issues
Common Symptoms and Causes
| Symptom | Likely cause | Fix |
|---|---|---|
| NaN loss | Gradient explosion, log(0), 0/0 | Gradient clipping, log-sum-exp, check data |
| Loss stuck at constant | Saturated activations, dead ReLU | Check initialization, use LeakyReLU |
| Loss oscillates wildly | Learning rate too high | Reduce LR, add warmup |
| Loss decreases then explodes | Gradient explosion at specific input | Gradient clipping, check for outliers |
| Very slow convergence | Vanishing gradients, tiny LR | Residual connections, increase LR |
| Inf values in weights | Unbounded optimization, no regularization | Add weight decay, clip gradients |
Debugging Checklist
- Check for NaN/Inf:
torch.isnan(loss).any()— add assertions early - Monitor gradient norms: Log per layer to detect explosion/vanishing
- Histogram weights and gradients: Look for distribution collapse (all near zero) or explosion
- Test with FP64: If training works in FP64 but fails in FP16/FP32, it is a precision issue
- Reduce learning rate: Many “mysterious” training failures are simply learning rate too high
Key insight: Most numerical issues in practice are caused by one of three things: (1) overflow in exponentials (fix with log-sum-exp), (2) gradient explosion (fix with clipping), or (3) loss of precision in FP16 (fix with loss scaling or BF16). Knowing these three patterns handles 90% of training failures.
Why This Matters for ML
Numerical stability is what separates a paper’s algorithm from a working implementation:
- Log-sum-exp is required for any computation involving softmax or log-probabilities — it is non-negotiable
- Gradient clipping is standard practice for RNNs, transformers, and GANs — without it, training is fragile
- Mixed precision (FP16/BF16) training doubles throughput and halves memory — essential for modern large models
- Gradient checkpointing enables training models that would not fit in GPU memory otherwise
- Understanding floating-point limitations explains many “mysterious” training failures and helps debug them systematically
Summary
- Computers use finite-precision arithmetic — overflow, underflow, and cancellation are real threats
- The log-sum-exp trick prevents overflow in softmax and cross-entropy — always use it
- Gradient clipping (norm or value) prevents catastrophic updates from exploding gradients
- Mixed precision (FP16/BF16 + FP32) doubles speed with loss scaling for numerical safety
- BFloat16 matches FP32 range, reducing the need for loss scaling
- Use
log1p,expm1, and fused operations (F.cross_entropywith logits) to avoid cancellation - Gradient checkpointing trades compute for memory, enabling larger models
- Monitor gradient norms and check for NaN/Inf early to catch problems before they cascade
- Next: non-smooth optimization handles functions that are not differentiable everywhere
References
- Micikevicius, P., et al. (2018). Mixed Precision Training. ICLR. arXiv:1710.03740
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
- Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
- Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174
- Kalamkar, D., et al. (2019). A Study of BFLOAT16 for Deep Learning Training. arXiv:1905.12322