Convexity and Convergence Theory: When Optimization Succeeds

Understand convex functions, global vs local optima, convergence rates, and the theoretical guarantees that underpin ML optimization algorithms.

Calculus & Optimization March 7, 2026 9 min read

Why Theory Matters

Throughout this series, we have built optimization algorithms — from gradient descent to Adam. But when do these algorithms actually work? How fast do they converge? When can we trust that we have found a good solution?

Convexity provides the strongest answers. For convex problems, every local minimum is a global minimum, and gradient descent is guaranteed to converge. For non-convex problems (neural networks), the theory is harder, but convex analysis still provides essential intuition and tools.

Convex Sets

A set CRn\mathcal{C} \subseteq \mathbb{R}^n is convex if the line segment between any two points in C\mathcal{C} lies entirely within C\mathcal{C}:

x,yC    tx+(1t)yCt[0,1]\mathbf{x}, \mathbf{y} \in \mathcal{C} \implies t\mathbf{x} + (1-t)\mathbf{y} \in \mathcal{C} \quad \forall \, t \in [0, 1]

Convex sets have no “dents” or “holes.” Examples:

  • Lines, planes, and hyperplanes
  • Balls: {x:xcr}\{\mathbf{x} : \|\mathbf{x} - \mathbf{c}\| \leq r\}
  • Halfspaces: {x:aTxb}\{\mathbf{x} : \mathbf{a}^T\mathbf{x} \leq b\}
  • The probability simplex: {p:pi0,pi=1}\{\mathbf{p} : p_i \geq 0, \sum p_i = 1\}

The intersection of convex sets is convex. This is why adding multiple constraints (each defining a convex set) still yields a convex feasible region.

Convex Functions

A function f:RnRf: \mathbb{R}^n \to \mathbb{R} is convex if for all x,y\mathbf{x}, \mathbf{y} and t[0,1]t \in [0, 1]:

f(tx+(1t)y)tf(x)+(1t)f(y)f(t\mathbf{x} + (1-t)\mathbf{y}) \leq t \, f(\mathbf{x}) + (1-t) f(\mathbf{y})

Geometrically, the function lies below (or on) the line segment connecting any two points on its graph. The surface “curves upward” everywhere.

Equivalent Characterizations

For twice-differentiable functions, convexity is equivalent to:

First-order condition: f(y)f(x)+f(x)T(yx)f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) for all x,y\mathbf{x}, \mathbf{y}

This says the tangent plane (first-order Taylor approximation) always lies below the function. In other words, a linear approximation of a convex function always underestimates it.

Second-order condition: H(x)0\mathbf{H}(\mathbf{x}) \succeq 0 (positive semidefinite) for all x\mathbf{x}

The Hessian has no negative eigenvalues anywhere — the function never curves downward.

Key insight: For convex functions, every local minimum is automatically a global minimum. This is because the first-order condition prevents any other point from having a lower value than a local minimum. Convexity eliminates the nightmare scenario of getting stuck in bad local minima.

Common Convex Functions in ML

FunctionFormulaWhere it appears
Squared error(yy^)2(y - \hat{y})^2MSE loss
Absolute erroryy^\|y - \hat{y}\|MAE loss
Cross-entropyylog(y^)(1y)log(1y^)-y\log(\hat{y}) - (1-y)\log(1-\hat{y})Logistic regression
Hinge lossmax(0,1yy^)\max(0, 1 - y\hat{y})SVM
L2 normw2\|\mathbf{w}\|^2Regularization
L1 normw1\|\mathbf{w}\|_1Lasso
Negative log-likelihoodlogp(yX,θ)-\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta})GLMs

Convexity-Preserving Operations

Convexity is preserved under:

  • Non-negative weighted sums: αf+βg\alpha f + \beta g with α,β0\alpha, \beta \geq 0
  • Pointwise maximum: max(f1,,fk)\max(f_1, \ldots, f_k)
  • Composition with affine: f(Ax+b)f(\mathbf{A}\mathbf{x} + \mathbf{b}) if ff is convex
  • Infimum/supremum (under certain conditions)

These rules let you verify convexity of complex objectives by decomposing them into simpler parts.

Strongly Convex Functions

A function is μ\mu-strongly convex if it curves upward at least as fast as a quadratic:

f(y)f(x)+f(x)T(yx)+μ2yx2f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) + \frac{\mu}{2}\|\mathbf{y} - \mathbf{x}\|^2

Equivalently, H(x)μI\mathbf{H}(\mathbf{x}) \succeq \mu \mathbf{I} — the minimum eigenvalue of the Hessian is at least μ\mu.

Strong convexity guarantees:

  • A unique global minimum
  • Exponential convergence of gradient descent (instead of O(1/t)O(1/t))
  • Stability: the minimum does not change much with small perturbations to the problem

Key insight: Adding L2 regularization λ2w2\frac{\lambda}{2}\|\mathbf{w}\|^2 to any convex loss makes it λ\lambda-strongly convex. This is one reason regularization helps optimization (not just generalization): it ensures the problem has a unique, well-defined solution and gradient descent converges exponentially.

Smoothness

A function ff is LL-smooth if its gradient is LL-Lipschitz:

f(x)f(y)Lxy\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\|

Equivalently, H(x)LI\mathbf{H}(\mathbf{x}) \preceq L\mathbf{I} — the maximum eigenvalue of the Hessian is at most LL.

Smoothness bounds how fast the gradient can change, which limits how far the function can deviate from its linear approximation. This is the condition that determines the maximum safe learning rate: α1/L\alpha \leq 1/L.

Convergence Rates

Convergence rates tell us how many iterations an algorithm needs to reach a given accuracy ϵ\epsilon. Here, ft=f(θt)f_t = f(\boldsymbol{\theta}_t) and f=f(θ)f^* = f(\boldsymbol{\theta}^*).

Gradient Descent

Problem classRateIterations for ϵ\epsilon accuracy
Convex + LL-smoothftfO(1/t)f_t - f^* \leq O(1/t)O(1/ϵ)O(1/\epsilon)
Strongly convex (μ\mu) + LL-smoothftfO((1μ/L)t)f_t - f^* \leq O((1 - \mu/L)^t)O(κlog(1/ϵ))O(\kappa \log(1/\epsilon))
Non-convex + LL-smoothft2O(1/t)\|\nabla f_t\|^2 \leq O(1/t)O(1/ϵ)O(1/\epsilon) to find stationary point

The condition number κ=L/μ\kappa = L/\mu determines convergence speed for strongly convex problems. A large κ\kappa (ill-conditioned) means slow convergence.

Accelerated Methods

Nesterov’s accelerated gradient (discussed in the SGD article) achieves faster rates:

Problem classGD rateNesterov rate
Convex + smoothO(1/t)O(1/t)O(1/t2)O(1/t^2)
Strongly convex + smoothO((11/κ)t)O((1 - 1/\kappa)^t)O((11/κ)t)O((1 - 1/\sqrt{\kappa})^t)

The accelerated rate for strongly convex functions is O((11/κ)t)O((1 - 1/\sqrt{\kappa})^t) versus O((11/κ)t)O((1 - 1/\kappa)^t). For κ=100\kappa = 100, GD needs 100\sim 100 iterations per order of magnitude improvement, while Nesterov needs only 10\sim 10.

Nesterov’s rate is provably optimal — no first-order method can do better on the class of smooth convex functions.

SGD Convergence

For stochastic gradient descent with bounded gradient variance σ2\sigma^2:

SettingRate
Convex, constant α=1/T\alpha = 1/\sqrt{T}O(1/T)O(1/\sqrt{T})
Strongly convex, αt=1/(μt)\alpha_t = 1/(\mu t)O(1/T)O(1/T)

SGD converges more slowly than GD due to gradient noise. The O(1/T)O(1/\sqrt{T}) rate for convex problems is optimal for stochastic first-order methods.

The Non-Convex Reality

Neural network loss functions are highly non-convex. The theory above does not directly apply. Yet gradient-based methods work remarkably well. Why?

Local Minima Quality

Empirical and theoretical evidence suggests that:

  • In overparameterized networks (more parameters than data), most local minima have loss values close to the global minimum
  • The loss landscape has many symmetries (permuting neurons) that create equivalent minima
  • SGD’s implicit regularization biases toward flat, well-generalizing minima

Loss Landscape Properties

Modern deep learning benefits from several structural properties:

  • Overparameterization: With enough parameters, the loss surface becomes smoother and local minima become global (or near-global)
  • Skip connections (ResNets): Create smoother loss landscapes with fewer problematic saddle points
  • Batch normalization: Reduces the dependence between layers, effectively improving the condition number
  • Gradient noise from SGD: Helps escape saddle points and sharp minima

Non-Convex Convergence Guarantees

For smooth non-convex functions, gradient descent guarantees convergence to a stationary point (fϵ\|\nabla f\| \leq \epsilon) in O(1/ϵ2)O(1/\epsilon^2) iterations. This does not guarantee a minimum — it could be a saddle point. However, with noise (SGD) or second-order information, saddle points can be escaped efficiently.

Key insight: Convex optimization theory provides the best-case guarantees. Non-convex optimization in deep learning works in practice because of overparameterization, architectural innovations, and the regularizing effect of SGD. The theory of why deep learning works is still an active research area, but convex intuitions remain the essential starting point.

Worked Example: Convergence Rate Comparison

Consider minimizing f(x)=12xTAxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x} where A\mathbf{A} has eigenvalues λ1=100\lambda_1 = 100 and λ2=1\lambda_2 = 1.

Parameters: L=100L = 100, μ=1\mu = 1, κ=100\kappa = 100.

Gradient descent with α=1/L=0.01\alpha = 1/L = 0.01:

ftf(κ1κ+1)2t(f0f)=(99101)2t(f0f)f_t - f^* \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2t}(f_0 - f^*) = \left(\frac{99}{101}\right)^{2t}(f_0 - f^*)

To reduce the error by a factor of 10610^{-6}: t6ln102ln(101/99)695t \approx \frac{6 \ln 10}{2 \ln(101/99)} \approx 695 iterations.

Nesterov accelerated gradient:

ftf(κ1κ+1)2t(f0f)=(911)2t(f0f)f_t - f^* \leq \left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^{2t}(f_0 - f^*) = \left(\frac{9}{11}\right)^{2t}(f_0 - f^*)

To reduce by 10610^{-6}: t6ln102ln(11/9)42t \approx \frac{6 \ln 10}{2 \ln(11/9)} \approx 42 iterations.

The accelerated method is 16x faster on this ill-conditioned problem.

Key Takeaways for Practice

ConceptPractical implication
ConvexityGuarantees any local solution is globally optimal
Strong convexity (μ\mu)L2 regularization creates this, ensuring unique solutions
Smoothness (LL)Determines max learning rate: α1/L\alpha \leq 1/L
Condition number (κ=L/μ\kappa = L/\mu)Predicts optimization difficulty; large κ\kappa = slow convergence
Nesterov accelerationOptimal rate for first-order methods; momentum approximates this
Non-convexity in DLWorks in practice due to overparameterization and SGD regularization

Why This Matters for ML

Convexity and convergence theory are the lens through which we understand optimization:

  • Convex ML models (linear/logistic regression, SVMs) have guaranteed global optima — their solutions are trustworthy
  • Regularization converts merely convex problems into strongly convex ones, improving both convergence and generalization
  • Learning rate selection (α1/L\alpha \leq 1/L) comes directly from smoothness analysis
  • Condition number explains why some problems train slowly and why preconditioning (adaptive methods) helps
  • Non-convex deep learning works empirically, but convex intuitions guide architecture and hyperparameter choices
  • This theory connects back to everything in the series: gradients, Taylor approximation, gradient descent, and constrained optimization

Summary

  • A convex function curves upward everywhere — every local minimum is a global minimum
  • Strong convexity (μ>0\mu > 0) guarantees a unique minimum and exponential convergence
  • Smoothness (LL) bounds gradient variation and determines the max learning rate
  • Condition number κ=L/μ\kappa = L/\mu predicts convergence speed — ill-conditioned problems converge slowly
  • Gradient descent converges at rate O(1/t)O(1/t) for convex and O((1μ/L)t)O((1-\mu/L)^t) for strongly convex functions
  • Nesterov acceleration achieves optimal O(1/t2)O(1/t^2) convergence — provably the best any first-order method can do
  • Non-convex deep learning loss surfaces work in practice due to overparameterization and SGD’s implicit regularization
  • L2 regularization makes problems strongly convex, improving both convergence and generalization
  • This completes the calculus and optimization series — from limits to convergence guarantees

References

  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapters 2-3, 9. stanford.edu
  • Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer.
  • Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3.
  • Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8(3-4), 231-357.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 4, 8. deeplearningbook.org
  • Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. NeurIPS.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay