Constrained Optimization: Lagrange Multipliers and KKT Conditions

Master Lagrange multipliers, KKT conditions, and duality — the tools for optimization with equality and inequality constraints in ML.

Calculus & Optimization March 7, 2026 9 min read

Optimization with Constraints

So far, we have minimized functions freely — gradient descent moves in whatever direction reduces the loss, without restrictions. But many ML problems come with constraints:

  • SVMs: Maximize the margin subject to classification constraints
  • Regularization: Minimize the loss subject to w2c\|\mathbf{w}\|^2 \leq c
  • Probability distributions: Outputs must be non-negative and sum to 1
  • Fairness: Predictions must satisfy demographic parity or equal opportunity
  • Resource allocation: Total budget or capacity limits

Constrained optimization provides the mathematical tools to handle all of these.

Equality Constraints: Lagrange Multipliers

The Setup

We want to solve:

minxf(x)subject tog(x)=0\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g(\mathbf{x}) = 0

where ff is the objective and g(x)=0g(\mathbf{x}) = 0 is the constraint.

The Key Idea

At the constrained minimum, you cannot decrease ff while staying on the constraint surface. This happens when the gradient of ff is parallel to the gradient of gg — any component of f\nabla f along the constraint surface would allow further improvement.

Geometric interpretation: The constraint g(x)=0g(\mathbf{x}) = 0 defines a surface. Level sets of ff are curves (in 2D) or surfaces (in higher dimensions). At the constrained optimum, a level set of ff is tangent to the constraint surface. Their normals (f\nabla f and g\nabla g) must be parallel.

This condition f=λg\nabla f = -\lambda \nabla g for some scalar λ\lambda is the essence of Lagrange multipliers.

The Lagrangian

We form the Lagrangian:

L(x,λ)=f(x)+λg(x)\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda \, g(\mathbf{x})

The Lagrange multiplier λ\lambda is a new variable that enforces the constraint. The necessary conditions for a constrained optimum are:

xL=f+λg=0\nabla_{\mathbf{x}} \mathcal{L} = \nabla f + \lambda \nabla g = \mathbf{0} Lλ=g(x)=0\frac{\partial \mathcal{L}}{\partial \lambda} = g(\mathbf{x}) = 0

The first condition says the gradients are parallel. The second says the constraint is satisfied. Together, these are a system of n+1n + 1 equations in n+1n + 1 unknowns (x\mathbf{x} and λ\lambda).

Worked Example: Optimizing on a Circle

Minimize f(x,y)=x+2yf(x, y) = x + 2y subject to x2+y2=1x^2 + y^2 = 1.

Constraint function: g(x,y)=x2+y21=0g(x, y) = x^2 + y^2 - 1 = 0.

Lagrangian: L=x+2y+λ(x2+y21)\mathcal{L} = x + 2y + \lambda(x^2 + y^2 - 1).

Setting partial derivatives to zero:

Lx=1+2λx=0    x=12λLy=2+2λy=0    y=1λLλ=x2+y21=0\begin{aligned} \frac{\partial \mathcal{L}}{\partial x} &= 1 + 2\lambda x = 0 \implies x = -\frac{1}{2\lambda} \\[6pt] \frac{\partial \mathcal{L}}{\partial y} &= 2 + 2\lambda y = 0 \implies y = -\frac{1}{\lambda} \\[6pt] \frac{\partial \mathcal{L}}{\partial \lambda} &= x^2 + y^2 - 1 = 0 \end{aligned}

Substituting into the constraint:

14λ2+1λ2=1    54λ2=1    λ2=54    λ=±52\frac{1}{4\lambda^2} + \frac{1}{\lambda^2} = 1 \implies \frac{5}{4\lambda^2} = 1 \implies \lambda^2 = \frac{5}{4} \implies \lambda = \pm\frac{\sqrt{5}}{2}

For λ=52\lambda = \frac{\sqrt{5}}{2}: x=15x = -\frac{1}{\sqrt{5}}, y=25y = -\frac{2}{\sqrt{5}}, f=5f = -\sqrt{5} (minimum).

For λ=52\lambda = -\frac{\sqrt{5}}{2}: x=15x = \frac{1}{\sqrt{5}}, y=25y = \frac{2}{\sqrt{5}}, f=5f = \sqrt{5} (maximum).

Multiple Equality Constraints

With kk constraints g1(x)=0,,gk(x)=0g_1(\mathbf{x}) = 0, \ldots, g_k(\mathbf{x}) = 0, we introduce kk multipliers:

L(x,λ)=f(x)+i=1kλigi(x)\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \sum_{i=1}^{k} \lambda_i \, g_i(\mathbf{x})

Interpreting the Multiplier

The Lagrange multiplier λ\lambda^* has a concrete meaning: it is the sensitivity of the optimal objective value to the constraint. If we relax the constraint g(x)=0g(\mathbf{x}) = 0 to g(x)=ϵg(\mathbf{x}) = \epsilon, the optimal value changes by approximately λϵ\lambda^* \epsilon.

This is called the shadow price in economics: λ\lambda^* tells you the value of slightly relaxing the constraint.

Inequality Constraints: KKT Conditions

The Setup

The general constrained optimization problem:

minxf(x)subject togi(x)0,i=1,,m\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m

Inequality constraints add a new subtlety: a constraint can be active (gi(x)=0g_i(\mathbf{x}) = 0, the solution sits on the boundary) or inactive (gi(x)<0g_i(\mathbf{x}) < 0, the solution is in the interior and the constraint is irrelevant).

The Karush-Kuhn-Tucker (KKT) Conditions

The KKT conditions are necessary conditions for optimality. At a constrained minimum x\mathbf{x}^* with multipliers μi\mu_i^*:

  1. Stationarity: f(x)+i=1mμigi(x)=0\nabla f(\mathbf{x}^*) + \sum_{i=1}^{m} \mu_i^* \nabla g_i(\mathbf{x}^*) = \mathbf{0}

  2. Primal feasibility: gi(x)0g_i(\mathbf{x}^*) \leq 0 for all ii

  3. Dual feasibility: μi0\mu_i^* \geq 0 for all ii

  4. Complementary slackness: μigi(x)=0\mu_i^* g_i(\mathbf{x}^*) = 0 for all ii

The first three conditions are natural extensions of Lagrange multipliers. The fourth — complementary slackness — is the new and crucial condition.

Understanding Complementary Slackness

For each constraint ii, either:

  • gi(x)=0g_i(\mathbf{x}^*) = 0 (the constraint is active / binding), or
  • μi=0\mu_i^* = 0 (the multiplier is zero — the constraint does not influence the solution)

Key insight: Inactive constraints (gi<0g_i < 0) are effectively absent from the problem. Only active constraints (gi=0g_i = 0) shape the solution. Complementary slackness formalizes this: if the constraint is not tight, its multiplier must be zero, meaning it has no influence. If the multiplier is positive, the constraint must be tight.

Worked Example

Minimize f(x)=(x3)2f(x) = (x - 3)^2 subject to x2x \leq 2.

Constraint: g(x)=x20g(x) = x - 2 \leq 0. KKT conditions:

  1. Stationarity: 2(x3)+μ=02(x - 3) + \mu = 0
  2. Primal feasibility: x2x \leq 2
  3. Dual feasibility: μ0\mu \geq 0
  4. Complementary slackness: μ(x2)=0\mu(x - 2) = 0

Case 1: μ=0\mu = 0 (constraint inactive). Then 2(x3)=0    x=32(x - 3) = 0 \implies x = 3. But x=3>2x = 3 > 2 violates primal feasibility. Rejected.

Case 2: x=2x = 2 (constraint active). Then 2(23)+μ=0    μ=2>02(2 - 3) + \mu = 0 \implies \mu = 2 > 0. This satisfies all conditions.

Solution: x=2x^* = 2, μ=2\mu^* = 2. The unconstrained minimum is at x=3x = 3, but the constraint pushes the solution to x=2x = 2.

Lagrangian Duality

The Dual Problem

Starting from the Lagrangian L(x,μ)=f(x)+μigi(x)\mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) = f(\mathbf{x}) + \sum \mu_i g_i(\mathbf{x}), we define:

Primal problem: minxmaxμ0L(x,μ)\min_{\mathbf{x}} \max_{\boldsymbol{\mu} \geq 0} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})

Dual problem: maxμ0minxL(x,μ)\max_{\boldsymbol{\mu} \geq 0} \min_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})

The dual function d(μ)=minxL(x,μ)d(\boldsymbol{\mu}) = \min_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) gives a lower bound on the optimal primal value for any μ0\boldsymbol{\mu} \geq 0.

Weak and Strong Duality

Weak duality: d(μ)f(x)d(\boldsymbol{\mu}^*) \leq f(\mathbf{x}^*) always holds. The dual optimal value never exceeds the primal optimal value.

Strong duality: d(μ)=f(x)d(\boldsymbol{\mu}^*) = f(\mathbf{x}^*) — the gap is zero. This holds under Slater’s condition: the constraints are convex and there exists a strictly feasible point (gi(x)<0g_i(\mathbf{x}) < 0 for all ii).

The duality gap f(x)d(μ)f(\mathbf{x}^*) - d(\boldsymbol{\mu}^*) measures how tight the bound is. When strong duality holds, the gap is zero and we can solve the dual instead of the primal if that is easier.

Applications in Machine Learning

Support Vector Machines

The SVM is the quintessential constrained optimization problem in ML. The primal formulation (hard-margin SVM):

minw,b12w2subject toyi(wTxi+b)1i\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall \, i

Applying the KKT conditions and forming the dual:

maxαi=1nαi12i,jαiαjyiyjxiTxjs.t.αi0,  iαiyi=0\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \quad \text{s.t.} \quad \alpha_i \geq 0, \; \sum_i \alpha_i y_i = 0

The dual form reveals that the solution depends only on dot products xiTxj\mathbf{x}_i^T\mathbf{x}_j, enabling the kernel trick — replacing dot products with kernel functions to handle nonlinear boundaries.

Complementary slackness gives us support vectors: only the training points where αi>0\alpha_i > 0 (those sitting on the margin boundary) influence the decision boundary. All other points can be removed without changing the solution.

Regularization as a Constraint

L2 regularization adds λw2\lambda\|\mathbf{w}\|^2 to the loss:

minwL(w)+λw2\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) + \lambda\|\mathbf{w}\|^2

By the duality we just developed, this is equivalent to the constrained form:

minwL(w)subject tow2c\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) \quad \text{subject to} \quad \|\mathbf{w}\|^2 \leq c

for a specific value of cc determined by λ\lambda. The Lagrange multiplier of the constraint is exactly the regularization strength.

Key insight: Regularization and constrained optimization are two views of the same problem. Adding a penalty λw2\lambda\|\mathbf{w}\|^2 to the loss is equivalent to constraining the norm w2c\|\mathbf{w}\|^2 \leq c. The regularization coefficient λ\lambda is the Lagrange multiplier of the norm constraint. This duality gives a deeper understanding of why regularization prevents overfitting — it constrains the model’s capacity.

Probability Constraints

Deriving the softmax function involves constrained optimization. Given logits z1,,zKz_1, \ldots, z_K, we want to maximize entropy (or equivalently, the log-partition function) subject to probabilities summing to 1:

maxpkzkpk+H(p)s.t.kpk=1,  pk0\max_{\mathbf{p}} \sum_{k} z_k p_k + H(\mathbf{p}) \quad \text{s.t.} \quad \sum_k p_k = 1, \; p_k \geq 0

The solution is the familiar softmax:

pk=ezkjezjp_k = \frac{e^{z_k}}{\sum_j e^{z_j}}

The Lagrange multiplier for the sum-to-one constraint is the log-partition function logjezj\log \sum_j e^{z_j}.

Penalty Methods (Brief)

When analytical KKT solutions are not tractable, penalty methods convert constrained problems to unconstrained ones by adding a penalty for constraint violations:

minxf(x)+ρimax(0,gi(x))2\min_{\mathbf{x}} f(\mathbf{x}) + \rho \sum_i \max(0, g_i(\mathbf{x}))^2

As ρ\rho \to \infty, the solution approaches the constrained optimum. The augmented Lagrangian method combines explicit multipliers with penalties, achieving faster convergence without requiring ρ\rho \to \infty.

Why This Matters for ML

Constrained optimization is woven throughout machine learning:

  • SVMs are solved via their dual, enabling the kernel trick
  • Regularization (L1, L2) is equivalent to norm constraints via Lagrangian duality
  • Softmax arises from constrained entropy maximization
  • Fairness constraints limit bias in model predictions
  • KKT conditions explain why only support vectors matter in SVMs and why Lasso produces sparse solutions (complementary slackness on the L1 constraint)

Summary

  • Lagrange multipliers transform constrained optimization into an unconstrained system by introducing multiplier variables
  • At the optimum, f\nabla f is parallel to g\nabla g — you cannot improve the objective while staying on the constraint
  • KKT conditions extend Lagrange multipliers to inequality constraints, adding dual feasibility and complementary slackness
  • Complementary slackness: inactive constraints have zero multipliers — they do not influence the solution
  • Strong duality (for convex problems) lets us solve the dual instead of the primal
  • SVMs are a direct application: the dual reveals support vectors and enables the kernel trick
  • Regularization is equivalent to constrained optimization through Lagrangian duality
  • Next: convexity and convergence provides the theoretical guarantees behind these methods

References

  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 5. stanford.edu
  • Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 12, 17.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 7.
  • Bertsekas, D. P. (2016). Nonlinear Programming (3rd ed.). Athena Scientific.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4.5. deeplearningbook.org

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay