- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Two Players, One Loss Function
Standard optimization has one objective and one player: minimize . Min-max optimization has two players with opposing goals:
Player 1 (the minimizer) chooses to make small. Player 2 (the maximizer) chooses to make large. The solution is a saddle point where neither player can improve unilaterally.
This structure appears throughout modern AI:
- GANs: Generator minimizes, discriminator maximizes
- Adversarial training: Model minimizes loss under worst-case perturbations
- RLHF: Policy maximizes reward while staying close to a reference
- Robust optimization: Minimize the worst-case loss over uncertainty sets
- Game theory: Nash equilibria are saddle points of payoff functions
Saddle Points and Nash Equilibria
A point is a saddle point of if:
This means minimizes and maximizes . Neither player benefits from deviating — this is a Nash equilibrium in game-theoretic terms.
Minimax vs Maximin
In general:
This is weak duality from constrained optimization. When equality holds (strong duality), a saddle point exists and the order of min and max does not matter.
For convex-concave functions ( convex in , concave in ), the minimax theorem (von Neumann, 1928) guarantees strong duality and the existence of a saddle point.
Gradient Descent-Ascent (GDA)
The simplest min-max algorithm alternates gradient steps:
Note the signs: descends (minimizes), ascends (maximizes).
The Instability Problem
Unlike standard gradient descent, GDA can diverge even on simple problems. Consider :
GDA gives and . The iterates spiral outward from the saddle point — they rotate rather than converge.
Key insight: Min-max optimization is fundamentally harder than minimization. The interaction between the two players creates rotational dynamics that standard gradient methods cannot handle. The gradient field around a saddle point is not a gradient of any function — it has a rotational (non-conservative) component that causes cycling.
Simultaneous vs Alternating Updates
- Simultaneous GDA: Both players update at the same time using gradients at the current point. Simpler but more prone to oscillation.
- Alternating GDA: One player updates, then the other uses the updated parameters. Slightly more stable but still oscillates on many problems.
Stabilizing Min-Max Optimization
Extra-Gradient Method
The extra-gradient method (Korpelevich, 1976) takes a “look-ahead” step before the actual update:
The extra gradient evaluation at the look-ahead point corrects the rotational component, enabling convergence. This is the min-max analog of Nesterov momentum.
Optimistic Gradient Descent-Ascent (OGDA)
OGDA approximates the extra-gradient method using the previous gradient instead of a look-ahead:
OGDA uses the gradient difference to estimate the rotational component and cancel it. It requires only one gradient evaluation per step (same cost as GDA).
Gradient Penalty
Adding a regularization term that penalizes the gradient norm:
This dampens the maximizer’s updates when the gradient is large, stabilizing training.
Generative Adversarial Networks (GANs)
GANs are the most prominent application of min-max optimization in deep learning. The generator and discriminator play a two-player game:
The discriminator maximizes its ability to distinguish real from fake. The generator minimizes the discriminator’s success — producing increasingly realistic samples.
Training Challenges
GAN training is notoriously unstable because:
- Non-convex-concave: Neither the generator’s nor the discriminator’s objective is convex
- Mode collapse: The generator produces limited variety, cycling through a few modes
- Vanishing gradients: When the discriminator is too strong, the generator gets no useful gradient signal
- Oscillation: The generator and discriminator chase each other without converging
Practical Stabilization Techniques
| Technique | How it helps |
|---|---|
| Spectral normalization | Constrains discriminator Lipschitz constant |
| Gradient penalty (WGAN-GP) | Enforces Lipschitz constraint via penalty |
| Two time-scale updates | Train discriminator more often or with larger LR |
| Progressive growing | Start small, gradually increase resolution |
| Non-saturating loss | Replace with for better gradients |
Wasserstein GAN (WGAN)
WGAN replaces the JS divergence with the Wasserstein distance, yielding:
where constrains the discriminator (called “critic”) to be 1-Lipschitz. This provides more meaningful gradients and more stable training.
Key insight: WGAN connects GANs to optimal transport theory. The Wasserstein distance provides gradients even when the distributions have non-overlapping support (where JS divergence fails). The Lipschitz constraint on the critic is enforced via gradient penalty or spectral normalization.
Adversarial Training and Robustness
Adversarial training makes models robust to worst-case input perturbations:
The inner maximization finds the worst-case perturbation (the adversarial example). The outer minimization trains the model to be robust against it.
PGD Adversarial Training
Projected Gradient Descent (PGD) approximately solves the inner maximization:
where projects back onto the -ball. This is typically run for 7-20 steps during training.
The outer minimization uses standard SGD or Adam on the adversarial loss.
The Robustness-Accuracy Trade-off
Adversarial training typically reduces clean accuracy while improving robustness. This trade-off is fundamental: a model that is invariant to perturbations within an -ball cannot use fine-grained features that vary within that ball.
RLHF and Constrained Policy Optimization
Reinforcement Learning from Human Feedback (RLHF) uses a min-max-like structure to align language models:
This maximizes reward while staying close to a reference policy . The KL constraint prevents the policy from degenerating to exploit the reward model.
The optimal solution has the form:
Direct Preference Optimization (DPO) reformulates this as a classification loss, avoiding the explicit reward model and the min-max structure entirely — a clever reparameterization that simplifies training.
Convergence Theory for Min-Max
Convex-Concave Games
For convex in and concave in :
- GDA converges with small enough learning rates
- Extra-gradient converges at rate
- Optimistic GDA also converges at
Non-Convex-Non-Concave Games
For general non-convex non-concave problems (like GANs):
- No general convergence guarantees exist
- GDA can cycle, diverge, or converge to limit cycles rather than fixed points
- Two time-scale updates (where the maximizer is “faster”) can help in specific settings
- The field is actively developing new algorithms and theory
Key insight: Min-max optimization is an active research frontier. Unlike minimization, where we have strong convergence theory for convex problems and good practical algorithms for non-convex ones, min-max optimization lacks both in the general non-convex case. GAN training works largely through empirical tricks rather than theoretical guarantees.
Comparison of Min-Max Methods
| Method | Cost per step | Stability | Theory |
|---|---|---|---|
| Simultaneous GDA | 1 gradient eval | Poor | Diverges on bilinear games |
| Alternating GDA | 1 gradient eval | Slightly better | Converges for some games |
| Extra-gradient | 2 gradient evals | Good | for convex-concave |
| OGDA | 1 gradient eval | Good | for convex-concave |
| GDA + gradient penalty | 1 gradient + penalty | Good for GANs | Empirical |
| Two time-scale GDA | 1 gradient, different LRs | Moderate | Theory for specific settings |
Why This Matters for ML
Min-max optimization powers some of the most impactful applications in modern AI:
- GANs generate realistic images, video, and audio through adversarial training
- Adversarial robustness protects models against worst-case attacks
- RLHF aligns language models with human preferences
- Distributionally robust optimization ensures models perform well under distribution shift
- Understanding min-max dynamics explains why GAN training is hard and guides the choice of stabilization techniques
- This extends all the optimization theory we have built to the multi-agent setting
Summary
- Min-max optimization involves two players with opposing objectives:
- The solution is a saddle point / Nash equilibrium where neither player can improve
- GDA is simple but can diverge due to rotational dynamics around saddle points
- Extra-gradient and OGDA correct the rotation, achieving convergence for convex-concave games
- GANs are min-max games between generator and discriminator — notoriously unstable without tricks like spectral normalization and gradient penalty
- Adversarial training solves an inner maximization (find worst-case attack) and outer minimization (train robust model)
- RLHF maximizes reward subject to a KL constraint from a reference policy
- Non-convex min-max optimization lacks strong theory — an active research frontier
- This completes the expanded calculus and optimization series, from limits to adversarial games
References
- Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS.
- Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. ICML. arXiv:1701.07875
- Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR. arXiv:1706.06083
- Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. arXiv:2305.18290
- Daskalakis, C., Ilyas, A., Syrgkanis, V., & Zeng, H. (2018). Training GANs with Optimism. ICLR.
- Razaviyayn, M., Huang, T., Lu, S., Nouiehed, M., Sanjabi, M., & Hong, M. (2020). Nonconvex Min-Max Optimization. arXiv:2002.08394