- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Optimization with Constraints
So far, we have minimized functions freely — gradient descent moves in whatever direction reduces the loss, without restrictions. But many ML problems come with constraints:
- SVMs: Maximize the margin subject to classification constraints
- Regularization: Minimize the loss subject to
- Probability distributions: Outputs must be non-negative and sum to 1
- Fairness: Predictions must satisfy demographic parity or equal opportunity
- Resource allocation: Total budget or capacity limits
Constrained optimization provides the mathematical tools to handle all of these.
Equality Constraints: Lagrange Multipliers
The Setup
We want to solve:
where is the objective and is the constraint.
The Key Idea
At the constrained minimum, you cannot decrease while staying on the constraint surface. This happens when the gradient of is parallel to the gradient of — any component of along the constraint surface would allow further improvement.
Geometric interpretation: The constraint defines a surface. Level sets of are curves (in 2D) or surfaces (in higher dimensions). At the constrained optimum, a level set of is tangent to the constraint surface. Their normals ( and ) must be parallel.
This condition for some scalar is the essence of Lagrange multipliers.
The Lagrangian
We form the Lagrangian:
The Lagrange multiplier is a new variable that enforces the constraint. The necessary conditions for a constrained optimum are:
The first condition says the gradients are parallel. The second says the constraint is satisfied. Together, these are a system of equations in unknowns ( and ).
Worked Example: Optimizing on a Circle
Minimize subject to .
Constraint function: .
Lagrangian: .
Setting partial derivatives to zero:
Substituting into the constraint:
For : , , (minimum).
For : , , (maximum).
Multiple Equality Constraints
With constraints , we introduce multipliers:
Interpreting the Multiplier
The Lagrange multiplier has a concrete meaning: it is the sensitivity of the optimal objective value to the constraint. If we relax the constraint to , the optimal value changes by approximately .
This is called the shadow price in economics: tells you the value of slightly relaxing the constraint.
Inequality Constraints: KKT Conditions
The Setup
The general constrained optimization problem:
Inequality constraints add a new subtlety: a constraint can be active (, the solution sits on the boundary) or inactive (, the solution is in the interior and the constraint is irrelevant).
The Karush-Kuhn-Tucker (KKT) Conditions
The KKT conditions are necessary conditions for optimality. At a constrained minimum with multipliers :
-
Stationarity:
-
Primal feasibility: for all
-
Dual feasibility: for all
-
Complementary slackness: for all
The first three conditions are natural extensions of Lagrange multipliers. The fourth — complementary slackness — is the new and crucial condition.
Understanding Complementary Slackness
For each constraint , either:
- (the constraint is active / binding), or
- (the multiplier is zero — the constraint does not influence the solution)
Key insight: Inactive constraints () are effectively absent from the problem. Only active constraints () shape the solution. Complementary slackness formalizes this: if the constraint is not tight, its multiplier must be zero, meaning it has no influence. If the multiplier is positive, the constraint must be tight.
Worked Example
Minimize subject to .
Constraint: . KKT conditions:
- Stationarity:
- Primal feasibility:
- Dual feasibility:
- Complementary slackness:
Case 1: (constraint inactive). Then . But violates primal feasibility. Rejected.
Case 2: (constraint active). Then . This satisfies all conditions.
Solution: , . The unconstrained minimum is at , but the constraint pushes the solution to .
Lagrangian Duality
The Dual Problem
Starting from the Lagrangian , we define:
Primal problem:
Dual problem:
The dual function gives a lower bound on the optimal primal value for any .
Weak and Strong Duality
Weak duality: always holds. The dual optimal value never exceeds the primal optimal value.
Strong duality: — the gap is zero. This holds under Slater’s condition: the constraints are convex and there exists a strictly feasible point ( for all ).
The duality gap measures how tight the bound is. When strong duality holds, the gap is zero and we can solve the dual instead of the primal if that is easier.
Applications in Machine Learning
Support Vector Machines
The SVM is the quintessential constrained optimization problem in ML. The primal formulation (hard-margin SVM):
Applying the KKT conditions and forming the dual:
The dual form reveals that the solution depends only on dot products , enabling the kernel trick — replacing dot products with kernel functions to handle nonlinear boundaries.
Complementary slackness gives us support vectors: only the training points where (those sitting on the margin boundary) influence the decision boundary. All other points can be removed without changing the solution.
Regularization as a Constraint
L2 regularization adds to the loss:
By the duality we just developed, this is equivalent to the constrained form:
for a specific value of determined by . The Lagrange multiplier of the constraint is exactly the regularization strength.
Key insight: Regularization and constrained optimization are two views of the same problem. Adding a penalty to the loss is equivalent to constraining the norm . The regularization coefficient is the Lagrange multiplier of the norm constraint. This duality gives a deeper understanding of why regularization prevents overfitting — it constrains the model’s capacity.
Probability Constraints
Deriving the softmax function involves constrained optimization. Given logits , we want to maximize entropy (or equivalently, the log-partition function) subject to probabilities summing to 1:
The solution is the familiar softmax:
The Lagrange multiplier for the sum-to-one constraint is the log-partition function .
Penalty Methods (Brief)
When analytical KKT solutions are not tractable, penalty methods convert constrained problems to unconstrained ones by adding a penalty for constraint violations:
As , the solution approaches the constrained optimum. The augmented Lagrangian method combines explicit multipliers with penalties, achieving faster convergence without requiring .
Why This Matters for ML
Constrained optimization is woven throughout machine learning:
- SVMs are solved via their dual, enabling the kernel trick
- Regularization (L1, L2) is equivalent to norm constraints via Lagrangian duality
- Softmax arises from constrained entropy maximization
- Fairness constraints limit bias in model predictions
- KKT conditions explain why only support vectors matter in SVMs and why Lasso produces sparse solutions (complementary slackness on the L1 constraint)
Summary
- Lagrange multipliers transform constrained optimization into an unconstrained system by introducing multiplier variables
- At the optimum, is parallel to — you cannot improve the objective while staying on the constraint
- KKT conditions extend Lagrange multipliers to inequality constraints, adding dual feasibility and complementary slackness
- Complementary slackness: inactive constraints have zero multipliers — they do not influence the solution
- Strong duality (for convex problems) lets us solve the dual instead of the primal
- SVMs are a direct application: the dual reveals support vectors and enables the kernel trick
- Regularization is equivalent to constrained optimization through Lagrangian duality
- Next: convexity and convergence provides the theoretical guarantees behind these methods
References
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 5. stanford.edu
- Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 12, 17.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 7.
- Bertsekas, D. P. (2016). Nonlinear Programming (3rd ed.). Athena Scientific.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4.5. deeplearningbook.org