Min-Max Optimization: Games, GANs, and Adversarial Training

Master min-max optimization for GANs, adversarial robustness, and RLHF — two-player games where one player minimizes while the other maximizes.

Calculus & Optimization March 7, 2026 9 min read

Two Players, One Loss Function

Standard optimization has one objective and one player: minimize L(θ)\mathcal{L}(\boldsymbol{\theta}). Min-max optimization has two players with opposing goals:

minθmaxϕF(θ,ϕ)\min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi}} F(\boldsymbol{\theta}, \boldsymbol{\phi})

Player 1 (the minimizer) chooses θ\boldsymbol{\theta} to make FF small. Player 2 (the maximizer) chooses ϕ\boldsymbol{\phi} to make FF large. The solution is a saddle point (θ,ϕ)(\boldsymbol{\theta}^*, \boldsymbol{\phi}^*) where neither player can improve unilaterally.

This structure appears throughout modern AI:

  • GANs: Generator minimizes, discriminator maximizes
  • Adversarial training: Model minimizes loss under worst-case perturbations
  • RLHF: Policy maximizes reward while staying close to a reference
  • Robust optimization: Minimize the worst-case loss over uncertainty sets
  • Game theory: Nash equilibria are saddle points of payoff functions

Saddle Points and Nash Equilibria

A point (θ,ϕ)(\boldsymbol{\theta}^*, \boldsymbol{\phi}^*) is a saddle point of FF if:

F(θ,ϕ)F(θ,ϕ)F(θ,ϕ)θ,ϕF(\boldsymbol{\theta}^*, \boldsymbol{\phi}) \leq F(\boldsymbol{\theta}^*, \boldsymbol{\phi}^*) \leq F(\boldsymbol{\theta}, \boldsymbol{\phi}^*) \quad \forall \, \boldsymbol{\theta}, \boldsymbol{\phi}

This means θ\boldsymbol{\theta}^* minimizes F(,ϕ)F(\cdot, \boldsymbol{\phi}^*) and ϕ\boldsymbol{\phi}^* maximizes F(θ,)F(\boldsymbol{\theta}^*, \cdot). Neither player benefits from deviating — this is a Nash equilibrium in game-theoretic terms.

Minimax vs Maximin

In general:

maxϕminθF(θ,ϕ)minθmaxϕF(θ,ϕ)\max_{\boldsymbol{\phi}} \min_{\boldsymbol{\theta}} F(\boldsymbol{\theta}, \boldsymbol{\phi}) \leq \min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi}} F(\boldsymbol{\theta}, \boldsymbol{\phi})

This is weak duality from constrained optimization. When equality holds (strong duality), a saddle point exists and the order of min and max does not matter.

For convex-concave functions (FF convex in θ\boldsymbol{\theta}, concave in ϕ\boldsymbol{\phi}), the minimax theorem (von Neumann, 1928) guarantees strong duality and the existence of a saddle point.

Gradient Descent-Ascent (GDA)

The simplest min-max algorithm alternates gradient steps:

θt+1=θtαθF(θt,ϕt)ϕt+1=ϕt+βϕF(θt,ϕt)\begin{aligned} \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \\[6pt] \boldsymbol{\phi}_{t+1} &= \boldsymbol{\phi}_t + \beta \nabla_{\boldsymbol{\phi}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \end{aligned}

Note the signs: θ\boldsymbol{\theta} descends (minimizes), ϕ\boldsymbol{\phi} ascends (maximizes).

The Instability Problem

Unlike standard gradient descent, GDA can diverge even on simple problems. Consider F(θ,ϕ)=θϕF(\theta, \phi) = \theta \phi:

θF=ϕ,ϕF=θ\nabla_\theta F = \phi, \quad \nabla_\phi F = \theta

GDA gives θt+1=θtαϕt\theta_{t+1} = \theta_t - \alpha\phi_t and ϕt+1=ϕt+αθt\phi_{t+1} = \phi_t + \alpha\theta_t. The iterates spiral outward from the saddle point (0,0)(0, 0) — they rotate rather than converge.

Key insight: Min-max optimization is fundamentally harder than minimization. The interaction between the two players creates rotational dynamics that standard gradient methods cannot handle. The gradient field around a saddle point is not a gradient of any function — it has a rotational (non-conservative) component that causes cycling.

Simultaneous vs Alternating Updates

  • Simultaneous GDA: Both players update at the same time using gradients at the current point. Simpler but more prone to oscillation.
  • Alternating GDA: One player updates, then the other uses the updated parameters. Slightly more stable but still oscillates on many problems.

Stabilizing Min-Max Optimization

Extra-Gradient Method

The extra-gradient method (Korpelevich, 1976) takes a “look-ahead” step before the actual update:

θˉ=θtαθF(θt,ϕt)ϕˉ=ϕt+αϕF(θt,ϕt)θt+1=θtαθF(θˉ,ϕˉ)ϕt+1=ϕt+αϕF(θˉ,ϕˉ)\begin{aligned} \bar{\boldsymbol{\theta}} &= \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \\[4pt] \bar{\boldsymbol{\phi}} &= \boldsymbol{\phi}_t + \alpha \nabla_{\boldsymbol{\phi}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} F(\bar{\boldsymbol{\theta}}, \bar{\boldsymbol{\phi}}) \\[4pt] \boldsymbol{\phi}_{t+1} &= \boldsymbol{\phi}_t + \alpha \nabla_{\boldsymbol{\phi}} F(\bar{\boldsymbol{\theta}}, \bar{\boldsymbol{\phi}}) \end{aligned}

The extra gradient evaluation at the look-ahead point corrects the rotational component, enabling convergence. This is the min-max analog of Nesterov momentum.

Optimistic Gradient Descent-Ascent (OGDA)

OGDA approximates the extra-gradient method using the previous gradient instead of a look-ahead:

θt+1=θt2αθFt+αθFt1ϕt+1=ϕt+2αϕFtαϕFt1\begin{aligned} \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - 2\alpha \nabla_{\boldsymbol{\theta}} F_t + \alpha \nabla_{\boldsymbol{\theta}} F_{t-1} \\[4pt] \boldsymbol{\phi}_{t+1} &= \boldsymbol{\phi}_t + 2\alpha \nabla_{\boldsymbol{\phi}} F_t - \alpha \nabla_{\boldsymbol{\phi}} F_{t-1} \end{aligned}

OGDA uses the gradient difference FtFt1\nabla F_t - \nabla F_{t-1} to estimate the rotational component and cancel it. It requires only one gradient evaluation per step (same cost as GDA).

Gradient Penalty

Adding a regularization term that penalizes the gradient norm:

F~(θ,ϕ)=F(θ,ϕ)+γ2ϕF2\tilde{F}(\boldsymbol{\theta}, \boldsymbol{\phi}) = F(\boldsymbol{\theta}, \boldsymbol{\phi}) + \frac{\gamma}{2}\|\nabla_{\boldsymbol{\phi}} F\|^2

This dampens the maximizer’s updates when the gradient is large, stabilizing training.

Generative Adversarial Networks (GANs)

GANs are the most prominent application of min-max optimization in deep learning. The generator GθG_{\boldsymbol{\theta}} and discriminator DϕD_{\boldsymbol{\phi}} play a two-player game:

minθmaxϕExpdata[logDϕ(x)]+Ezp(z)[log(1Dϕ(Gθ(z)))]\min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi}} \, \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D_{\boldsymbol{\phi}}(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D_{\boldsymbol{\phi}}(G_{\boldsymbol{\theta}}(\mathbf{z})))]

The discriminator maximizes its ability to distinguish real from fake. The generator minimizes the discriminator’s success — producing increasingly realistic samples.

Training Challenges

GAN training is notoriously unstable because:

  1. Non-convex-concave: Neither the generator’s nor the discriminator’s objective is convex
  2. Mode collapse: The generator produces limited variety, cycling through a few modes
  3. Vanishing gradients: When the discriminator is too strong, the generator gets no useful gradient signal
  4. Oscillation: The generator and discriminator chase each other without converging

Practical Stabilization Techniques

TechniqueHow it helps
Spectral normalizationConstrains discriminator Lipschitz constant
Gradient penalty (WGAN-GP)Enforces Lipschitz constraint via penalty
Two time-scale updatesTrain discriminator more often or with larger LR
Progressive growingStart small, gradually increase resolution
Non-saturating lossReplace log(1D(G(z)))\log(1 - D(G(z))) with log(D(G(z)))-\log(D(G(z))) for better gradients

Wasserstein GAN (WGAN)

WGAN replaces the JS divergence with the Wasserstein distance, yielding:

minθmaxϕFLEx[Dϕ(x)]Ez[Dϕ(Gθ(z))]\min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi} \in \mathcal{F}_L} \, \mathbb{E}_{\mathbf{x}}[D_{\boldsymbol{\phi}}(\mathbf{x})] - \mathbb{E}_{\mathbf{z}}[D_{\boldsymbol{\phi}}(G_{\boldsymbol{\theta}}(\mathbf{z}))]

where FL\mathcal{F}_L constrains the discriminator (called “critic”) to be 1-Lipschitz. This provides more meaningful gradients and more stable training.

Key insight: WGAN connects GANs to optimal transport theory. The Wasserstein distance provides gradients even when the distributions have non-overlapping support (where JS divergence fails). The Lipschitz constraint on the critic is enforced via gradient penalty or spectral normalization.

Adversarial Training and Robustness

Adversarial training makes models robust to worst-case input perturbations:

minθE(x,y)[maxδϵL(θ;x+δ,y)]\min_{\boldsymbol{\theta}} \mathbb{E}_{(\mathbf{x}, y)}\left[\max_{\|\boldsymbol{\delta}\| \leq \epsilon} \mathcal{L}(\boldsymbol{\theta}; \mathbf{x} + \boldsymbol{\delta}, y)\right]

The inner maximization finds the worst-case perturbation δ\boldsymbol{\delta} (the adversarial example). The outer minimization trains the model to be robust against it.

PGD Adversarial Training

Projected Gradient Descent (PGD) approximately solves the inner maximization:

δk+1=Πδϵ(δk+αsign(δL))\boldsymbol{\delta}_{k+1} = \Pi_{\|\boldsymbol{\delta}\| \leq \epsilon}\left(\boldsymbol{\delta}_k + \alpha \, \text{sign}(\nabla_{\boldsymbol{\delta}} \mathcal{L})\right)

where Π\Pi projects back onto the ϵ\epsilon-ball. This is typically run for 7-20 steps during training.

The outer minimization uses standard SGD or Adam on the adversarial loss.

The Robustness-Accuracy Trade-off

Adversarial training typically reduces clean accuracy while improving robustness. This trade-off is fundamental: a model that is invariant to perturbations within an ϵ\epsilon-ball cannot use fine-grained features that vary within that ball.

RLHF and Constrained Policy Optimization

Reinforcement Learning from Human Feedback (RLHF) uses a min-max-like structure to align language models:

maxπExπ[r(x)]βDKL(ππref)\max_{\pi} \mathbb{E}_{\mathbf{x} \sim \pi}[r(\mathbf{x})] - \beta \, D_{KL}(\pi \| \pi_{\text{ref}})

This maximizes reward while staying close to a reference policy πref\pi_{\text{ref}}. The KL constraint prevents the policy from degenerating to exploit the reward model.

The optimal solution has the form:

π(x)πref(x)exp(r(x)/β)\pi^*(\mathbf{x}) \propto \pi_{\text{ref}}(\mathbf{x}) \exp(r(\mathbf{x}) / \beta)

Direct Preference Optimization (DPO) reformulates this as a classification loss, avoiding the explicit reward model and the min-max structure entirely — a clever reparameterization that simplifies training.

Convergence Theory for Min-Max

Convex-Concave Games

For FF convex in θ\boldsymbol{\theta} and concave in ϕ\boldsymbol{\phi}:

  • GDA converges with small enough learning rates
  • Extra-gradient converges at rate O(1/t)O(1/t)
  • Optimistic GDA also converges at O(1/t)O(1/t)

Non-Convex-Non-Concave Games

For general non-convex non-concave problems (like GANs):

  • No general convergence guarantees exist
  • GDA can cycle, diverge, or converge to limit cycles rather than fixed points
  • Two time-scale updates (where the maximizer is “faster”) can help in specific settings
  • The field is actively developing new algorithms and theory

Key insight: Min-max optimization is an active research frontier. Unlike minimization, where we have strong convergence theory for convex problems and good practical algorithms for non-convex ones, min-max optimization lacks both in the general non-convex case. GAN training works largely through empirical tricks rather than theoretical guarantees.

Comparison of Min-Max Methods

MethodCost per stepStabilityTheory
Simultaneous GDA1 gradient evalPoorDiverges on bilinear games
Alternating GDA1 gradient evalSlightly betterConverges for some games
Extra-gradient2 gradient evalsGoodO(1/t)O(1/t) for convex-concave
OGDA1 gradient evalGoodO(1/t)O(1/t) for convex-concave
GDA + gradient penalty1 gradient + penaltyGood for GANsEmpirical
Two time-scale GDA1 gradient, different LRsModerateTheory for specific settings

Why This Matters for ML

Min-max optimization powers some of the most impactful applications in modern AI:

  • GANs generate realistic images, video, and audio through adversarial training
  • Adversarial robustness protects models against worst-case attacks
  • RLHF aligns language models with human preferences
  • Distributionally robust optimization ensures models perform well under distribution shift
  • Understanding min-max dynamics explains why GAN training is hard and guides the choice of stabilization techniques
  • This extends all the optimization theory we have built to the multi-agent setting

Summary

  • Min-max optimization involves two players with opposing objectives: minθmaxϕF(θ,ϕ)\min_\theta \max_\phi F(\theta, \phi)
  • The solution is a saddle point / Nash equilibrium where neither player can improve
  • GDA is simple but can diverge due to rotational dynamics around saddle points
  • Extra-gradient and OGDA correct the rotation, achieving O(1/t)O(1/t) convergence for convex-concave games
  • GANs are min-max games between generator and discriminator — notoriously unstable without tricks like spectral normalization and gradient penalty
  • Adversarial training solves an inner maximization (find worst-case attack) and outer minimization (train robust model)
  • RLHF maximizes reward subject to a KL constraint from a reference policy
  • Non-convex min-max optimization lacks strong theory — an active research frontier
  • This completes the expanded calculus and optimization series, from limits to adversarial games

References

  • Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS.
  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. ICML. arXiv:1701.07875
  • Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR. arXiv:1706.06083
  • Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. arXiv:2305.18290
  • Daskalakis, C., Ilyas, A., Syrgkanis, V., & Zeng, H. (2018). Training GANs with Optimism. ICLR.
  • Razaviyayn, M., Huang, T., Lu, S., Nouiehed, M., Sanjabi, M., & Hong, M. (2020). Nonconvex Min-Max Optimization. arXiv:2002.08394

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay