- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Why Theory Matters
Throughout this series, we have built optimization algorithms — from gradient descent to Adam. But when do these algorithms actually work? How fast do they converge? When can we trust that we have found a good solution?
Convexity provides the strongest answers. For convex problems, every local minimum is a global minimum, and gradient descent is guaranteed to converge. For non-convex problems (neural networks), the theory is harder, but convex analysis still provides essential intuition and tools.
Convex Sets
A set is convex if the line segment between any two points in lies entirely within :
Convex sets have no “dents” or “holes.” Examples:
- Lines, planes, and hyperplanes
- Balls:
- Halfspaces:
- The probability simplex:
The intersection of convex sets is convex. This is why adding multiple constraints (each defining a convex set) still yields a convex feasible region.
Convex Functions
A function is convex if for all and :
Geometrically, the function lies below (or on) the line segment connecting any two points on its graph. The surface “curves upward” everywhere.
Equivalent Characterizations
For twice-differentiable functions, convexity is equivalent to:
First-order condition: for all
This says the tangent plane (first-order Taylor approximation) always lies below the function. In other words, a linear approximation of a convex function always underestimates it.
Second-order condition: (positive semidefinite) for all
The Hessian has no negative eigenvalues anywhere — the function never curves downward.
Key insight: For convex functions, every local minimum is automatically a global minimum. This is because the first-order condition prevents any other point from having a lower value than a local minimum. Convexity eliminates the nightmare scenario of getting stuck in bad local minima.
Common Convex Functions in ML
| Function | Formula | Where it appears |
|---|---|---|
| Squared error | MSE loss | |
| Absolute error | MAE loss | |
| Cross-entropy | Logistic regression | |
| Hinge loss | SVM | |
| L2 norm | Regularization | |
| L1 norm | Lasso | |
| Negative log-likelihood | GLMs |
Convexity-Preserving Operations
Convexity is preserved under:
- Non-negative weighted sums: with
- Pointwise maximum:
- Composition with affine: if is convex
- Infimum/supremum (under certain conditions)
These rules let you verify convexity of complex objectives by decomposing them into simpler parts.
Strongly Convex Functions
A function is -strongly convex if it curves upward at least as fast as a quadratic:
Equivalently, — the minimum eigenvalue of the Hessian is at least .
Strong convexity guarantees:
- A unique global minimum
- Exponential convergence of gradient descent (instead of )
- Stability: the minimum does not change much with small perturbations to the problem
Key insight: Adding L2 regularization to any convex loss makes it -strongly convex. This is one reason regularization helps optimization (not just generalization): it ensures the problem has a unique, well-defined solution and gradient descent converges exponentially.
Smoothness
A function is -smooth if its gradient is -Lipschitz:
Equivalently, — the maximum eigenvalue of the Hessian is at most .
Smoothness bounds how fast the gradient can change, which limits how far the function can deviate from its linear approximation. This is the condition that determines the maximum safe learning rate: .
Convergence Rates
Convergence rates tell us how many iterations an algorithm needs to reach a given accuracy . Here, and .
Gradient Descent
| Problem class | Rate | Iterations for accuracy |
|---|---|---|
| Convex + -smooth | ||
| Strongly convex () + -smooth | ||
| Non-convex + -smooth | to find stationary point |
The condition number determines convergence speed for strongly convex problems. A large (ill-conditioned) means slow convergence.
Accelerated Methods
Nesterov’s accelerated gradient (discussed in the SGD article) achieves faster rates:
| Problem class | GD rate | Nesterov rate |
|---|---|---|
| Convex + smooth | ||
| Strongly convex + smooth |
The accelerated rate for strongly convex functions is versus . For , GD needs iterations per order of magnitude improvement, while Nesterov needs only .
Nesterov’s rate is provably optimal — no first-order method can do better on the class of smooth convex functions.
SGD Convergence
For stochastic gradient descent with bounded gradient variance :
| Setting | Rate |
|---|---|
| Convex, constant | |
| Strongly convex, |
SGD converges more slowly than GD due to gradient noise. The rate for convex problems is optimal for stochastic first-order methods.
The Non-Convex Reality
Neural network loss functions are highly non-convex. The theory above does not directly apply. Yet gradient-based methods work remarkably well. Why?
Local Minima Quality
Empirical and theoretical evidence suggests that:
- In overparameterized networks (more parameters than data), most local minima have loss values close to the global minimum
- The loss landscape has many symmetries (permuting neurons) that create equivalent minima
- SGD’s implicit regularization biases toward flat, well-generalizing minima
Loss Landscape Properties
Modern deep learning benefits from several structural properties:
- Overparameterization: With enough parameters, the loss surface becomes smoother and local minima become global (or near-global)
- Skip connections (ResNets): Create smoother loss landscapes with fewer problematic saddle points
- Batch normalization: Reduces the dependence between layers, effectively improving the condition number
- Gradient noise from SGD: Helps escape saddle points and sharp minima
Non-Convex Convergence Guarantees
For smooth non-convex functions, gradient descent guarantees convergence to a stationary point () in iterations. This does not guarantee a minimum — it could be a saddle point. However, with noise (SGD) or second-order information, saddle points can be escaped efficiently.
Key insight: Convex optimization theory provides the best-case guarantees. Non-convex optimization in deep learning works in practice because of overparameterization, architectural innovations, and the regularizing effect of SGD. The theory of why deep learning works is still an active research area, but convex intuitions remain the essential starting point.
Worked Example: Convergence Rate Comparison
Consider minimizing where has eigenvalues and .
Parameters: , , .
Gradient descent with :
To reduce the error by a factor of : iterations.
Nesterov accelerated gradient:
To reduce by : iterations.
The accelerated method is 16x faster on this ill-conditioned problem.
Key Takeaways for Practice
| Concept | Practical implication |
|---|---|
| Convexity | Guarantees any local solution is globally optimal |
| Strong convexity () | L2 regularization creates this, ensuring unique solutions |
| Smoothness () | Determines max learning rate: |
| Condition number () | Predicts optimization difficulty; large = slow convergence |
| Nesterov acceleration | Optimal rate for first-order methods; momentum approximates this |
| Non-convexity in DL | Works in practice due to overparameterization and SGD regularization |
Why This Matters for ML
Convexity and convergence theory are the lens through which we understand optimization:
- Convex ML models (linear/logistic regression, SVMs) have guaranteed global optima — their solutions are trustworthy
- Regularization converts merely convex problems into strongly convex ones, improving both convergence and generalization
- Learning rate selection () comes directly from smoothness analysis
- Condition number explains why some problems train slowly and why preconditioning (adaptive methods) helps
- Non-convex deep learning works empirically, but convex intuitions guide architecture and hyperparameter choices
- This theory connects back to everything in the series: gradients, Taylor approximation, gradient descent, and constrained optimization
Summary
- A convex function curves upward everywhere — every local minimum is a global minimum
- Strong convexity () guarantees a unique minimum and exponential convergence
- Smoothness () bounds gradient variation and determines the max learning rate
- Condition number predicts convergence speed — ill-conditioned problems converge slowly
- Gradient descent converges at rate for convex and for strongly convex functions
- Nesterov acceleration achieves optimal convergence — provably the best any first-order method can do
- Non-convex deep learning loss surfaces work in practice due to overparameterization and SGD’s implicit regularization
- L2 regularization makes problems strongly convex, improving both convergence and generalization
- This completes the calculus and optimization series — from limits to convergence guarantees
References
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapters 2-3, 9. stanford.edu
- Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer.
- Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3.
- Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8(3-4), 231-357.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 4, 8. deeplearningbook.org
- Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. NeurIPS.