Convexity and Convergence Theory: When Optimization Succeeds

Calculus & Optimization Series 10 / 18

Why Theory Matters

Throughout this series, we have built optimization algorithms — from gradient descent to Adam. But when do these algorithms actually work? How fast do they converge? When can we trust that we have found a good solution?

Convexity provides the strongest answers. For convex problems, every local minimum is a global minimum, and gradient descent is guaranteed to converge. For non-convex problems (neural networks), the theory is harder, but convex analysis still provides essential intuition and tools.

Convex Sets

A set $\mathcal{C} \subseteq \mathbb{R}^n$ is convex if the line segment between any two points in $\mathcal{C}$ lies entirely within $\mathcal{C}$ :

\mathbf{x}, \mathbf{y} \in \mathcal{C} \implies t\mathbf{x} + (1-t)\mathbf{y} \in \mathcal{C} \quad \forall \, t \in [0, 1]

Convex sets have no “dents” or “holes.” Examples:

Lines, planes, and hyperplanes
Balls: $\{\mathbf{x} : \|\mathbf{x} - \mathbf{c}\| \leq r\}$
Halfspaces: $\{\mathbf{x} : \mathbf{a}^T\mathbf{x} \leq b\}$
The probability simplex: $\{\mathbf{p} : p_i \geq 0, \sum p_i = 1\}$

The intersection of convex sets is convex. This is why adding multiple constraints (each defining a convex set) still yields a convex feasible region.

Convex Functions

A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if for all $\mathbf{x}, \mathbf{y}$ and $t \in [0, 1]$ :

f(t\mathbf{x} + (1-t)\mathbf{y}) \leq t \, f(\mathbf{x}) + (1-t) f(\mathbf{y})

Geometrically, the function lies below (or on) the line segment connecting any two points on its graph. The surface “curves upward” everywhere.

Equivalent Characterizations

For twice-differentiable functions, convexity is equivalent to:

First-order condition: $f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x})$ for all $\mathbf{x}, \mathbf{y}$

This says the tangent plane (first-order Taylor approximation) always lies below the function. In other words, a linear approximation of a convex function always underestimates it.

Second-order condition: $\mathbf{H}(\mathbf{x}) \succeq 0$ (positive semidefinite) for all $\mathbf{x}$

The Hessian has no negative eigenvalues anywhere — the function never curves downward.

Key insight: For convex functions, every local minimum is automatically a global minimum. This is because the first-order condition prevents any other point from having a lower value than a local minimum. Convexity eliminates the nightmare scenario of getting stuck in bad local minima.

Common Convex Functions in ML

Function	Formula	Where it appears
Squared error	$(y - \hat{y})^2$	MSE loss
Absolute error	$\\|y - \hat{y}\\|$	MAE loss
Cross-entropy	$-y\log(\hat{y}) - (1-y)\log(1-\hat{y})$	Logistic regression
Hinge loss	$\max(0, 1 - y\hat{y})$	SVM
L2 norm	$\\|\mathbf{w}\\|^2$	Regularization
L1 norm	$\\|\mathbf{w}\\|_1$	Lasso
Negative log-likelihood	$-\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta})$	GLMs

Convexity-Preserving Operations

Convexity is preserved under:

Non-negative weighted sums: $\alpha f + \beta g$ with $\alpha, \beta \geq 0$
Pointwise maximum: $\max(f_1, \ldots, f_k)$
Composition with affine: $f(\mathbf{A}\mathbf{x} + \mathbf{b})$ if $f$ is convex
Infimum/supremum (under certain conditions)

These rules let you verify convexity of complex objectives by decomposing them into simpler parts.

Strongly Convex Functions

A function is $\mu$ -strongly convex if it curves upward at least as fast as a quadratic:

f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) + \frac{\mu}{2}\|\mathbf{y} - \mathbf{x}\|^2

Equivalently, $\mathbf{H}(\mathbf{x}) \succeq \mu \mathbf{I}$ — the minimum eigenvalue of the Hessian is at least $\mu$ .

Strong convexity guarantees:

A unique global minimum
Exponential convergence of gradient descent (instead of $O(1/t)$ )
Stability: the minimum does not change much with small perturbations to the problem

Key insight: Adding L2 regularization $\frac{\lambda}{2}\|\mathbf{w}\|^2$ to any convex loss makes it $\lambda$ -strongly convex. This is one reason regularization helps optimization (not just generalization): it ensures the problem has a unique, well-defined solution and gradient descent converges exponentially.

Smoothness

A function $f$ is $L$ -smooth if its gradient is $L$ -Lipschitz:

\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\|

Equivalently, $\mathbf{H}(\mathbf{x}) \preceq L\mathbf{I}$ — the maximum eigenvalue of the Hessian is at most $L$ .

Smoothness bounds how fast the gradient can change, which limits how far the function can deviate from its linear approximation. This is the condition that determines the maximum safe learning rate: $\alpha \leq 1/L$ .

Convergence Rates

Convergence rates tell us how many iterations an algorithm needs to reach a given accuracy $\epsilon$ . Here, $f_t = f(\boldsymbol{\theta}_t)$ and $f^* = f(\boldsymbol{\theta}^*)$ .

Gradient Descent

Problem class	Rate	Iterations for $\epsilon$ accuracy
Convex + $L$ -smooth	$f_t - f^* \leq O(1/t)$	$O(1/\epsilon)$
Strongly convex ( $\mu$ ) + $L$ -smooth	$f_t - f^* \leq O((1 - \mu/L)^t)$	$O(\kappa \log(1/\epsilon))$
Non-convex + $L$ -smooth	$\\|\nabla f_t\\|^2 \leq O(1/t)$	$O(1/\epsilon)$ to find stationary point

The condition number $\kappa = L/\mu$ determines convergence speed for strongly convex problems. A large $\kappa$ (ill-conditioned) means slow convergence.

Accelerated Methods

Nesterov’s accelerated gradient (discussed in the SGD article) achieves faster rates:

Problem class	GD rate	Nesterov rate
Convex + smooth	$O(1/t)$	$O(1/t^2)$
Strongly convex + smooth	$O((1 - 1/\kappa)^t)$	$O((1 - 1/\sqrt{\kappa})^t)$

The accelerated rate for strongly convex functions is $O((1 - 1/\sqrt{\kappa})^t)$ versus $O((1 - 1/\kappa)^t)$ . For $\kappa = 100$ , GD needs $\sim 100$ iterations per order of magnitude improvement, while Nesterov needs only $\sim 10$ .

Nesterov’s rate is provably optimal — no first-order method can do better on the class of smooth convex functions.

SGD Convergence

For stochastic gradient descent with bounded gradient variance $\sigma^2$ :

Setting	Rate
Convex, constant $\alpha = 1/\sqrt{T}$	$O(1/\sqrt{T})$
Strongly convex, $\alpha_t = 1/(\mu t)$	$O(1/T)$

SGD converges more slowly than GD due to gradient noise. The $O(1/\sqrt{T})$ rate for convex problems is optimal for stochastic first-order methods.

The Non-Convex Reality

Neural network loss functions are highly non-convex. The theory above does not directly apply. Yet gradient-based methods work remarkably well. Why?

Local Minima Quality

Empirical and theoretical evidence suggests that:

In overparameterized networks (more parameters than data), most local minima have loss values close to the global minimum
The loss landscape has many symmetries (permuting neurons) that create equivalent minima
SGD’s implicit regularization biases toward flat, well-generalizing minima

Loss Landscape Properties

Modern deep learning benefits from several structural properties:

Overparameterization: With enough parameters, the loss surface becomes smoother and local minima become global (or near-global)
Skip connections (ResNets): Create smoother loss landscapes with fewer problematic saddle points
Batch normalization: Reduces the dependence between layers, effectively improving the condition number
Gradient noise from SGD: Helps escape saddle points and sharp minima

Non-Convex Convergence Guarantees

For smooth non-convex functions, gradient descent guarantees convergence to a stationary point ( $\|\nabla f\| \leq \epsilon$ ) in $O(1/\epsilon^2)$ iterations. This does not guarantee a minimum — it could be a saddle point. However, with noise (SGD) or second-order information, saddle points can be escaped efficiently.

Key insight: Convex optimization theory provides the best-case guarantees. Non-convex optimization in deep learning works in practice because of overparameterization, architectural innovations, and the regularizing effect of SGD. The theory of why deep learning works is still an active research area, but convex intuitions remain the essential starting point.

Worked Example: Convergence Rate Comparison

Consider minimizing $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x}$ where $\mathbf{A}$ has eigenvalues $\lambda_1 = 100$ and $\lambda_2 = 1$ .

Parameters: $L = 100$ , $\mu = 1$ , $\kappa = 100$ .

Gradient descent with $\alpha = 1/L = 0.01$ :

f_t - f^* \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2t}(f_0 - f^*) = \left(\frac{99}{101}\right)^{2t}(f_0 - f^*)

To reduce the error by a factor of $10^{-6}$ : $t \approx \frac{6 \ln 10}{2 \ln(101/99)} \approx 695$ iterations.

Nesterov accelerated gradient:

f_t - f^* \leq \left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^{2t}(f_0 - f^*) = \left(\frac{9}{11}\right)^{2t}(f_0 - f^*)

To reduce by $10^{-6}$ : $t \approx \frac{6 \ln 10}{2 \ln(11/9)} \approx 42$ iterations.

The accelerated method is 16x faster on this ill-conditioned problem.

Key Takeaways for Practice

Concept	Practical implication
Convexity	Guarantees any local solution is globally optimal
Strong convexity ( $\mu$ )	L2 regularization creates this, ensuring unique solutions
Smoothness ( $L$ )	Determines max learning rate: $\alpha \leq 1/L$
Condition number ( $\kappa = L/\mu$ )	Predicts optimization difficulty; large $\kappa$ = slow convergence
Nesterov acceleration	Optimal rate for first-order methods; momentum approximates this
Non-convexity in DL	Works in practice due to overparameterization and SGD regularization

Why This Matters for ML

Convexity and convergence theory are the lens through which we understand optimization:

Convex ML models (linear/logistic regression, SVMs) have guaranteed global optima — their solutions are trustworthy
Regularization converts merely convex problems into strongly convex ones, improving both convergence and generalization
Learning rate selection ( $\alpha \leq 1/L$ ) comes directly from smoothness analysis
Condition number explains why some problems train slowly and why preconditioning (adaptive methods) helps
Non-convex deep learning works empirically, but convex intuitions guide architecture and hyperparameter choices
This theory connects back to everything in the series: gradients, Taylor approximation, gradient descent, and constrained optimization

Summary

A convex function curves upward everywhere — every local minimum is a global minimum
Strong convexity ( $\mu > 0$ ) guarantees a unique minimum and exponential convergence
Smoothness ( $L$ ) bounds gradient variation and determines the max learning rate
Condition number $\kappa = L/\mu$ predicts convergence speed — ill-conditioned problems converge slowly
Gradient descent converges at rate $O(1/t)$ for convex and $O((1-\mu/L)^t)$ for strongly convex functions
Nesterov acceleration achieves optimal $O(1/t^2)$ convergence — provably the best any first-order method can do
Non-convex deep learning loss surfaces work in practice due to overparameterization and SGD’s implicit regularization
L2 regularization makes problems strongly convex, improving both convergence and generalization
This completes the calculus and optimization series — from limits to convergence guarantees

References

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapters 2-3, 9. stanford.edu
Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3.
Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8(3-4), 231-357.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 4, 8. deeplearningbook.org
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. NeurIPS.