Constrained Optimization: Lagrange Multipliers and KKT Conditions

Calculus & Optimization Series 9 / 18

Optimization with Constraints

So far, we have minimized functions freely — gradient descent moves in whatever direction reduces the loss, without restrictions. But many ML problems come with constraints:

SVMs: Maximize the margin subject to classification constraints
Regularization: Minimize the loss subject to $\|\mathbf{w}\|^2 \leq c$
Probability distributions: Outputs must be non-negative and sum to 1
Fairness: Predictions must satisfy demographic parity or equal opportunity
Resource allocation: Total budget or capacity limits

Constrained optimization provides the mathematical tools to handle all of these.

Equality Constraints: Lagrange Multipliers

The Setup

We want to solve:

\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g(\mathbf{x}) = 0

where $f$ is the objective and $g(\mathbf{x}) = 0$ is the constraint.

The Key Idea

At the constrained minimum, you cannot decrease $f$ while staying on the constraint surface. This happens when the gradient of $f$ is parallel to the gradient of $g$ — any component of $\nabla f$ along the constraint surface would allow further improvement.

Geometric interpretation: The constraint $g(\mathbf{x}) = 0$ defines a surface. Level sets of $f$ are curves (in 2D) or surfaces (in higher dimensions). At the constrained optimum, a level set of $f$ is tangent to the constraint surface. Their normals ( $\nabla f$ and $\nabla g$ ) must be parallel.

This condition $\nabla f = -\lambda \nabla g$ for some scalar $\lambda$ is the essence of Lagrange multipliers.

The Lagrangian

We form the Lagrangian:

\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda \, g(\mathbf{x})

The Lagrange multiplier $\lambda$ is a new variable that enforces the constraint. The necessary conditions for a constrained optimum are:

\nabla_{\mathbf{x}} \mathcal{L} = \nabla f + \lambda \nabla g = \mathbf{0}

\frac{\partial \mathcal{L}}{\partial \lambda} = g(\mathbf{x}) = 0

The first condition says the gradients are parallel. The second says the constraint is satisfied. Together, these are a system of $n + 1$ equations in $n + 1$ unknowns ( $\mathbf{x}$ and $\lambda$ ).

Worked Example: Optimizing on a Circle

Minimize $f(x, y) = x + 2y$ subject to $x^2 + y^2 = 1$ .

Constraint function: $g(x, y) = x^2 + y^2 - 1 = 0$ .

Lagrangian: $\mathcal{L} = x + 2y + \lambda(x^2 + y^2 - 1)$ .

Setting partial derivatives to zero:

\begin{aligned} \frac{\partial \mathcal{L}}{\partial x} &= 1 + 2\lambda x = 0 \implies x = -\frac{1}{2\lambda} \\[6pt] \frac{\partial \mathcal{L}}{\partial y} &= 2 + 2\lambda y = 0 \implies y = -\frac{1}{\lambda} \\[6pt] \frac{\partial \mathcal{L}}{\partial \lambda} &= x^2 + y^2 - 1 = 0 \end{aligned}

Substituting into the constraint:

\frac{1}{4\lambda^2} + \frac{1}{\lambda^2} = 1 \implies \frac{5}{4\lambda^2} = 1 \implies \lambda^2 = \frac{5}{4} \implies \lambda = \pm\frac{\sqrt{5}}{2}

For $\lambda = \frac{\sqrt{5}}{2}$ : $x = -\frac{1}{\sqrt{5}}$ , $y = -\frac{2}{\sqrt{5}}$ , $f = -\sqrt{5}$ (minimum).

For $\lambda = -\frac{\sqrt{5}}{2}$ : $x = \frac{1}{\sqrt{5}}$ , $y = \frac{2}{\sqrt{5}}$ , $f = \sqrt{5}$ (maximum).

Multiple Equality Constraints

With $k$ constraints $g_1(\mathbf{x}) = 0, \ldots, g_k(\mathbf{x}) = 0$ , we introduce $k$ multipliers:

\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \sum_{i=1}^{k} \lambda_i \, g_i(\mathbf{x})

Interpreting the Multiplier

The Lagrange multiplier $\lambda^*$ has a concrete meaning: it is the sensitivity of the optimal objective value to the constraint. If we relax the constraint $g(\mathbf{x}) = 0$ to $g(\mathbf{x}) = \epsilon$ , the optimal value changes by approximately $\lambda^* \epsilon$ .

This is called the shadow price in economics: $\lambda^*$ tells you the value of slightly relaxing the constraint.

Inequality Constraints: KKT Conditions

The Setup

The general constrained optimization problem:

\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m

Inequality constraints add a new subtlety: a constraint can be active ( $g_i(\mathbf{x}) = 0$ , the solution sits on the boundary) or inactive ( $g_i(\mathbf{x}) < 0$ , the solution is in the interior and the constraint is irrelevant).

The Karush-Kuhn-Tucker (KKT) Conditions

The KKT conditions are necessary conditions for optimality. At a constrained minimum $\mathbf{x}^*$ with multipliers $\mu_i^*$ :

Stationarity: $\nabla f(\mathbf{x}^*) + \sum_{i=1}^{m} \mu_i^* \nabla g_i(\mathbf{x}^*) = \mathbf{0}$
Primal feasibility: $g_i(\mathbf{x}^*) \leq 0$ for all $i$
Dual feasibility: $\mu_i^* \geq 0$ for all $i$
Complementary slackness: $\mu_i^* g_i(\mathbf{x}^*) = 0$ for all $i$

The first three conditions are natural extensions of Lagrange multipliers. The fourth — complementary slackness — is the new and crucial condition.

Understanding Complementary Slackness

For each constraint $i$ , either:

$g_i(\mathbf{x}^*) = 0$ (the constraint is active / binding), or
$\mu_i^* = 0$ (the multiplier is zero — the constraint does not influence the solution)

Key insight: Inactive constraints ( $g_i < 0$ ) are effectively absent from the problem. Only active constraints ( $g_i = 0$ ) shape the solution. Complementary slackness formalizes this: if the constraint is not tight, its multiplier must be zero, meaning it has no influence. If the multiplier is positive, the constraint must be tight.

Worked Example

Minimize $f(x) = (x - 3)^2$ subject to $x \leq 2$ .

Constraint: $g(x) = x - 2 \leq 0$ . KKT conditions:

Stationarity: $2(x - 3) + \mu = 0$
Primal feasibility: $x \leq 2$
Dual feasibility: $\mu \geq 0$
Complementary slackness: $\mu(x - 2) = 0$

Case 1: $\mu = 0$ (constraint inactive). Then $2(x - 3) = 0 \implies x = 3$ . But $x = 3 > 2$ violates primal feasibility. Rejected.

Case 2: $x = 2$ (constraint active). Then $2(2 - 3) + \mu = 0 \implies \mu = 2 > 0$ . This satisfies all conditions.

Solution: $x^* = 2$ , $\mu^* = 2$ . The unconstrained minimum is at $x = 3$ , but the constraint pushes the solution to $x = 2$ .

Lagrangian Duality

The Dual Problem

Starting from the Lagrangian $\mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) = f(\mathbf{x}) + \sum \mu_i g_i(\mathbf{x})$ , we define:

Primal problem: $\min_{\mathbf{x}} \max_{\boldsymbol{\mu} \geq 0} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})$

Dual problem: $\max_{\boldsymbol{\mu} \geq 0} \min_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})$

The dual function $d(\boldsymbol{\mu}) = \min_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})$ gives a lower bound on the optimal primal value for any $\boldsymbol{\mu} \geq 0$ .

Weak and Strong Duality

Weak duality: $d(\boldsymbol{\mu}^*) \leq f(\mathbf{x}^*)$ always holds. The dual optimal value never exceeds the primal optimal value.

Strong duality: $d(\boldsymbol{\mu}^*) = f(\mathbf{x}^*)$ — the gap is zero. This holds under Slater’s condition: the constraints are convex and there exists a strictly feasible point ( $g_i(\mathbf{x}) < 0$ for all $i$ ).

The duality gap $f(\mathbf{x}^*) - d(\boldsymbol{\mu}^*)$ measures how tight the bound is. When strong duality holds, the gap is zero and we can solve the dual instead of the primal if that is easier.

Applications in Machine Learning

Support Vector Machines

The SVM is the quintessential constrained optimization problem in ML. The primal formulation (hard-margin SVM):

\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall \, i

Applying the KKT conditions and forming the dual:

\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \quad \text{s.t.} \quad \alpha_i \geq 0, \; \sum_i \alpha_i y_i = 0

The dual form reveals that the solution depends only on dot products $\mathbf{x}_i^T\mathbf{x}_j$ , enabling the kernel trick — replacing dot products with kernel functions to handle nonlinear boundaries.

Complementary slackness gives us support vectors: only the training points where $\alpha_i > 0$ (those sitting on the margin boundary) influence the decision boundary. All other points can be removed without changing the solution.

Regularization as a Constraint

L2 regularization adds $\lambda\|\mathbf{w}\|^2$ to the loss:

\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) + \lambda\|\mathbf{w}\|^2

By the duality we just developed, this is equivalent to the constrained form:

\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) \quad \text{subject to} \quad \|\mathbf{w}\|^2 \leq c

for a specific value of $c$ determined by $\lambda$ . The Lagrange multiplier of the constraint is exactly the regularization strength.

Key insight: Regularization and constrained optimization are two views of the same problem. Adding a penalty $\lambda\|\mathbf{w}\|^2$ to the loss is equivalent to constraining the norm $\|\mathbf{w}\|^2 \leq c$ . The regularization coefficient $\lambda$ is the Lagrange multiplier of the norm constraint. This duality gives a deeper understanding of why regularization prevents overfitting — it constrains the model’s capacity.

Probability Constraints

Deriving the softmax function involves constrained optimization. Given logits $z_1, \ldots, z_K$ , we want to maximize entropy (or equivalently, the log-partition function) subject to probabilities summing to 1:

\max_{\mathbf{p}} \sum_{k} z_k p_k + H(\mathbf{p}) \quad \text{s.t.} \quad \sum_k p_k = 1, \; p_k \geq 0

The solution is the familiar softmax:

p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}

The Lagrange multiplier for the sum-to-one constraint is the log-partition function $\log \sum_j e^{z_j}$ .

Penalty Methods (Brief)

When analytical KKT solutions are not tractable, penalty methods convert constrained problems to unconstrained ones by adding a penalty for constraint violations:

\min_{\mathbf{x}} f(\mathbf{x}) + \rho \sum_i \max(0, g_i(\mathbf{x}))^2

As $\rho \to \infty$ , the solution approaches the constrained optimum. The augmented Lagrangian method combines explicit multipliers with penalties, achieving faster convergence without requiring $\rho \to \infty$ .

Why This Matters for ML

Constrained optimization is woven throughout machine learning:

SVMs are solved via their dual, enabling the kernel trick
Regularization (L1, L2) is equivalent to norm constraints via Lagrangian duality
Softmax arises from constrained entropy maximization
Fairness constraints limit bias in model predictions
KKT conditions explain why only support vectors matter in SVMs and why Lasso produces sparse solutions (complementary slackness on the L1 constraint)

Summary

Lagrange multipliers transform constrained optimization into an unconstrained system by introducing multiplier variables
At the optimum, $\nabla f$ is parallel to $\nabla g$ — you cannot improve the objective while staying on the constraint
KKT conditions extend Lagrange multipliers to inequality constraints, adding dual feasibility and complementary slackness
Complementary slackness: inactive constraints have zero multipliers — they do not influence the solution
Strong duality (for convex problems) lets us solve the dual instead of the primal
SVMs are a direct application: the dual reveals support vectors and enables the kernel trick
Regularization is equivalent to constrained optimization through Lagrangian duality
Next: convexity and convergence provides the theoretical guarantees behind these methods

References

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 5. stanford.edu
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 12, 17.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 7.
Bertsekas, D. P. (2016). Nonlinear Programming (3rd ed.). Athena Scientific.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4.5. deeplearningbook.org

Constrained Optimization: Lagrange Multipliers and KKT Conditions

Optimization with Constraints

Equality Constraints: Lagrange Multipliers

The Setup

The Key Idea

The Lagrangian

Worked Example: Optimizing on a Circle

Multiple Equality Constraints

Interpreting the Multiplier

Inequality Constraints: KKT Conditions

The Setup

The Karush-Kuhn-Tucker (KKT) Conditions

Understanding Complementary Slackness

Worked Example

Lagrangian Duality

The Dual Problem

Weak and Strong Duality

Applications in Machine Learning

Support Vector Machines

Regularization as a Constraint

Probability Constraints

Penalty Methods (Brief)

Why This Matters for ML

Summary

References

Keyboard Shortcuts