Min-Max Optimization: Games, GANs, and Adversarial Training

Calculus & Optimization Series 18 / 18

Two Players, One Loss Function

Standard optimization has one objective and one player: minimize $\mathcal{L}(\boldsymbol{\theta})$ . Min-max optimization has two players with opposing goals:

\min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi}} F(\boldsymbol{\theta}, \boldsymbol{\phi})

Player 1 (the minimizer) chooses $\boldsymbol{\theta}$ to make $F$ small. Player 2 (the maximizer) chooses $\boldsymbol{\phi}$ to make $F$ large. The solution is a saddle point $(\boldsymbol{\theta}^*, \boldsymbol{\phi}^*)$ where neither player can improve unilaterally.

This structure appears throughout modern AI:

GANs: Generator minimizes, discriminator maximizes
Adversarial training: Model minimizes loss under worst-case perturbations
RLHF: Policy maximizes reward while staying close to a reference
Robust optimization: Minimize the worst-case loss over uncertainty sets
Game theory: Nash equilibria are saddle points of payoff functions

Saddle Points and Nash Equilibria

A point $(\boldsymbol{\theta}^*, \boldsymbol{\phi}^*)$ is a saddle point of $F$ if:

F(\boldsymbol{\theta}^*, \boldsymbol{\phi}) \leq F(\boldsymbol{\theta}^*, \boldsymbol{\phi}^*) \leq F(\boldsymbol{\theta}, \boldsymbol{\phi}^*) \quad \forall \, \boldsymbol{\theta}, \boldsymbol{\phi}

This means $\boldsymbol{\theta}^*$ minimizes $F(\cdot, \boldsymbol{\phi}^*)$ and $\boldsymbol{\phi}^*$ maximizes $F(\boldsymbol{\theta}^*, \cdot)$ . Neither player benefits from deviating — this is a Nash equilibrium in game-theoretic terms.

Minimax vs Maximin

In general:

\max_{\boldsymbol{\phi}} \min_{\boldsymbol{\theta}} F(\boldsymbol{\theta}, \boldsymbol{\phi}) \leq \min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi}} F(\boldsymbol{\theta}, \boldsymbol{\phi})

This is weak duality from constrained optimization. When equality holds (strong duality), a saddle point exists and the order of min and max does not matter.

For convex-concave functions ( $F$ convex in $\boldsymbol{\theta}$ , concave in $\boldsymbol{\phi}$ ), the minimax theorem (von Neumann, 1928) guarantees strong duality and the existence of a saddle point.

Gradient Descent-Ascent (GDA)

The simplest min-max algorithm alternates gradient steps:

\begin{aligned} \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \\[6pt] \boldsymbol{\phi}_{t+1} &= \boldsymbol{\phi}_t + \beta \nabla_{\boldsymbol{\phi}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \end{aligned}

Note the signs: $\boldsymbol{\theta}$ descends (minimizes), $\boldsymbol{\phi}$ ascends (maximizes).

The Instability Problem

Unlike standard gradient descent, GDA can diverge even on simple problems. Consider $F(\theta, \phi) = \theta \phi$ :

\nabla_\theta F = \phi, \quad \nabla_\phi F = \theta

GDA gives $\theta_{t+1} = \theta_t - \alpha\phi_t$ and $\phi_{t+1} = \phi_t + \alpha\theta_t$ . The iterates spiral outward from the saddle point $(0, 0)$ — they rotate rather than converge.

Key insight: Min-max optimization is fundamentally harder than minimization. The interaction between the two players creates rotational dynamics that standard gradient methods cannot handle. The gradient field around a saddle point is not a gradient of any function — it has a rotational (non-conservative) component that causes cycling.

Simultaneous vs Alternating Updates

Simultaneous GDA: Both players update at the same time using gradients at the current point. Simpler but more prone to oscillation.
Alternating GDA: One player updates, then the other uses the updated parameters. Slightly more stable but still oscillates on many problems.

Stabilizing Min-Max Optimization

Extra-Gradient Method

The extra-gradient method (Korpelevich, 1976) takes a “look-ahead” step before the actual update:

\begin{aligned} \bar{\boldsymbol{\theta}} &= \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \\[4pt] \bar{\boldsymbol{\phi}} &= \boldsymbol{\phi}_t + \alpha \nabla_{\boldsymbol{\phi}} F(\boldsymbol{\theta}_t, \boldsymbol{\phi}_t) \\[6pt] \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} F(\bar{\boldsymbol{\theta}}, \bar{\boldsymbol{\phi}}) \\[4pt] \boldsymbol{\phi}_{t+1} &= \boldsymbol{\phi}_t + \alpha \nabla_{\boldsymbol{\phi}} F(\bar{\boldsymbol{\theta}}, \bar{\boldsymbol{\phi}}) \end{aligned}

The extra gradient evaluation at the look-ahead point corrects the rotational component, enabling convergence. This is the min-max analog of Nesterov momentum.

Optimistic Gradient Descent-Ascent (OGDA)

OGDA approximates the extra-gradient method using the previous gradient instead of a look-ahead:

\begin{aligned} \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - 2\alpha \nabla_{\boldsymbol{\theta}} F_t + \alpha \nabla_{\boldsymbol{\theta}} F_{t-1} \\[4pt] \boldsymbol{\phi}_{t+1} &= \boldsymbol{\phi}_t + 2\alpha \nabla_{\boldsymbol{\phi}} F_t - \alpha \nabla_{\boldsymbol{\phi}} F_{t-1} \end{aligned}

OGDA uses the gradient difference $\nabla F_t - \nabla F_{t-1}$ to estimate the rotational component and cancel it. It requires only one gradient evaluation per step (same cost as GDA).

Gradient Penalty

Adding a regularization term that penalizes the gradient norm:

\tilde{F}(\boldsymbol{\theta}, \boldsymbol{\phi}) = F(\boldsymbol{\theta}, \boldsymbol{\phi}) + \frac{\gamma}{2}\|\nabla_{\boldsymbol{\phi}} F\|^2

This dampens the maximizer’s updates when the gradient is large, stabilizing training.

Generative Adversarial Networks (GANs)

GANs are the most prominent application of min-max optimization in deep learning. The generator $G_{\boldsymbol{\theta}}$ and discriminator $D_{\boldsymbol{\phi}}$ play a two-player game:

\min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi}} \, \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D_{\boldsymbol{\phi}}(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D_{\boldsymbol{\phi}}(G_{\boldsymbol{\theta}}(\mathbf{z})))]

The discriminator maximizes its ability to distinguish real from fake. The generator minimizes the discriminator’s success — producing increasingly realistic samples.

Training Challenges

GAN training is notoriously unstable because:

Non-convex-concave: Neither the generator’s nor the discriminator’s objective is convex
Mode collapse: The generator produces limited variety, cycling through a few modes
Vanishing gradients: When the discriminator is too strong, the generator gets no useful gradient signal
Oscillation: The generator and discriminator chase each other without converging

Practical Stabilization Techniques

Technique	How it helps
Spectral normalization	Constrains discriminator Lipschitz constant
Gradient penalty (WGAN-GP)	Enforces Lipschitz constraint via penalty
Two time-scale updates	Train discriminator more often or with larger LR
Progressive growing	Start small, gradually increase resolution
Non-saturating loss	Replace $\log(1 - D(G(z)))$ with $-\log(D(G(z)))$ for better gradients

Wasserstein GAN (WGAN)

WGAN replaces the JS divergence with the Wasserstein distance, yielding:

\min_{\boldsymbol{\theta}} \max_{\boldsymbol{\phi} \in \mathcal{F}_L} \, \mathbb{E}_{\mathbf{x}}[D_{\boldsymbol{\phi}}(\mathbf{x})] - \mathbb{E}_{\mathbf{z}}[D_{\boldsymbol{\phi}}(G_{\boldsymbol{\theta}}(\mathbf{z}))]

where $\mathcal{F}_L$ constrains the discriminator (called “critic”) to be 1-Lipschitz. This provides more meaningful gradients and more stable training.

Key insight: WGAN connects GANs to optimal transport theory. The Wasserstein distance provides gradients even when the distributions have non-overlapping support (where JS divergence fails). The Lipschitz constraint on the critic is enforced via gradient penalty or spectral normalization.

Adversarial Training and Robustness

Adversarial training makes models robust to worst-case input perturbations:

\min_{\boldsymbol{\theta}} \mathbb{E}_{(\mathbf{x}, y)}\left[\max_{\|\boldsymbol{\delta}\| \leq \epsilon} \mathcal{L}(\boldsymbol{\theta}; \mathbf{x} + \boldsymbol{\delta}, y)\right]

The inner maximization finds the worst-case perturbation $\boldsymbol{\delta}$ (the adversarial example). The outer minimization trains the model to be robust against it.

PGD Adversarial Training

Projected Gradient Descent (PGD) approximately solves the inner maximization:

\boldsymbol{\delta}_{k+1} = \Pi_{\|\boldsymbol{\delta}\| \leq \epsilon}\left(\boldsymbol{\delta}_k + \alpha \, \text{sign}(\nabla_{\boldsymbol{\delta}} \mathcal{L})\right)

where $\Pi$ projects back onto the $\epsilon$ -ball. This is typically run for 7-20 steps during training.

The outer minimization uses standard SGD or Adam on the adversarial loss.

The Robustness-Accuracy Trade-off

Adversarial training typically reduces clean accuracy while improving robustness. This trade-off is fundamental: a model that is invariant to perturbations within an $\epsilon$ -ball cannot use fine-grained features that vary within that ball.

RLHF and Constrained Policy Optimization

Reinforcement Learning from Human Feedback (RLHF) uses a min-max-like structure to align language models:

\max_{\pi} \mathbb{E}_{\mathbf{x} \sim \pi}[r(\mathbf{x})] - \beta \, D_{KL}(\pi \| \pi_{\text{ref}})

This maximizes reward while staying close to a reference policy $\pi_{\text{ref}}$ . The KL constraint prevents the policy from degenerating to exploit the reward model.

The optimal solution has the form:

\pi^*(\mathbf{x}) \propto \pi_{\text{ref}}(\mathbf{x}) \exp(r(\mathbf{x}) / \beta)

Direct Preference Optimization (DPO) reformulates this as a classification loss, avoiding the explicit reward model and the min-max structure entirely — a clever reparameterization that simplifies training.

Convergence Theory for Min-Max

Convex-Concave Games

For $F$ convex in $\boldsymbol{\theta}$ and concave in $\boldsymbol{\phi}$ :

GDA converges with small enough learning rates
Extra-gradient converges at rate $O(1/t)$
Optimistic GDA also converges at $O(1/t)$

Non-Convex-Non-Concave Games

For general non-convex non-concave problems (like GANs):

No general convergence guarantees exist
GDA can cycle, diverge, or converge to limit cycles rather than fixed points
Two time-scale updates (where the maximizer is “faster”) can help in specific settings
The field is actively developing new algorithms and theory

Key insight: Min-max optimization is an active research frontier. Unlike minimization, where we have strong convergence theory for convex problems and good practical algorithms for non-convex ones, min-max optimization lacks both in the general non-convex case. GAN training works largely through empirical tricks rather than theoretical guarantees.

Comparison of Min-Max Methods

Method	Cost per step	Stability	Theory
Simultaneous GDA	1 gradient eval	Poor	Diverges on bilinear games
Alternating GDA	1 gradient eval	Slightly better	Converges for some games
Extra-gradient	2 gradient evals	Good	$O(1/t)$ for convex-concave
OGDA	1 gradient eval	Good	$O(1/t)$ for convex-concave
GDA + gradient penalty	1 gradient + penalty	Good for GANs	Empirical
Two time-scale GDA	1 gradient, different LRs	Moderate	Theory for specific settings

Why This Matters for ML

Min-max optimization powers some of the most impactful applications in modern AI:

GANs generate realistic images, video, and audio through adversarial training
Adversarial robustness protects models against worst-case attacks
RLHF aligns language models with human preferences
Distributionally robust optimization ensures models perform well under distribution shift
Understanding min-max dynamics explains why GAN training is hard and guides the choice of stabilization techniques
This extends all the optimization theory we have built to the multi-agent setting

Summary

Min-max optimization involves two players with opposing objectives: $\min_\theta \max_\phi F(\theta, \phi)$
The solution is a saddle point / Nash equilibrium where neither player can improve
GDA is simple but can diverge due to rotational dynamics around saddle points
Extra-gradient and OGDA correct the rotation, achieving $O(1/t)$ convergence for convex-concave games
GANs are min-max games between generator and discriminator — notoriously unstable without tricks like spectral normalization and gradient penalty
Adversarial training solves an inner maximization (find worst-case attack) and outer minimization (train robust model)
RLHF maximizes reward subject to a KL constraint from a reference policy
Non-convex min-max optimization lacks strong theory — an active research frontier
This completes the expanded calculus and optimization series, from limits to adversarial games

References

Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. ICML. arXiv:1701.07875
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR. arXiv:1706.06083
Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. arXiv:2305.18290
Daskalakis, C., Ilyas, A., Syrgkanis, V., & Zeng, H. (2018). Training GANs with Optimism. ICLR.
Razaviyayn, M., Huang, T., Lu, S., Nouiehed, M., Sanjabi, M., & Hong, M. (2020). Nonconvex Min-Max Optimization. arXiv:2002.08394