Non-Smooth Optimization and Proximal Methods

Calculus & Optimization Series 15 / 18

When Gradients Do Not Exist

The optimization methods we have studied — gradient descent, Adam — assume the objective is differentiable. But many important ML objectives are not:

L1 regularization $\|\mathbf{w}\|_1 = \sum |w_i|$ has a kink at every $w_i = 0$
ReLU $\max(0, x)$ is not differentiable at $x = 0$
Hinge loss $\max(0, 1 - y\hat{y})$ has a corner at $y\hat{y} = 1$
Max-pooling selects the maximum element — non-smooth at ties
Quantization (rounding to integers) is piecewise constant — zero derivative almost everywhere

These functions are non-smooth: they have points where the derivative does not exist. We need tools that generalize gradient descent to handle them.

Subgradients

A subgradient of a convex function $f$ at $\mathbf{x}$ is any vector $\mathbf{g}$ satisfying:

f(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{g}^T(\mathbf{y} - \mathbf{x}) \quad \forall \, \mathbf{y}

Geometrically, $\mathbf{g}$ defines a supporting hyperplane that touches the graph of $f$ at $\mathbf{x}$ and lies below $f$ everywhere.

At differentiable points, the only subgradient is the gradient: $\mathbf{g} = \nabla f(\mathbf{x})$ . At non-differentiable points, there are multiple subgradients — the set of all subgradients is called the subdifferential $\partial f(\mathbf{x})$ .

Example: Absolute Value

For $f(x) = |x|$ :

\partial f(x) = \begin{cases} \{-1\} & x < 0 \\ [-1, 1] & x = 0 \\ \{1\} & x > 0 \end{cases}

At $x = 0$ , any value in $[-1, 1]$ is a valid subgradient. At all other points, the subdifferential is a single point (the derivative).

Example: L1 Norm

For $f(\mathbf{w}) = \|\mathbf{w}\|_1 = \sum_i |w_i|$ , the subdifferential is component-wise:

[\partial f(\mathbf{w})]_i = \begin{cases} \{-1\} & w_i < 0 \\ [-1, 1] & w_i = 0 \\ \{1\} & w_i > 0 \end{cases}

This is the sign function with a set-valued extension at zero: $\partial \|\mathbf{w}\|_1 = \text{sign}(\mathbf{w})$ .

Subgradient Descent

Replace the gradient with a subgradient:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha_t \mathbf{g}_t, \quad \mathbf{g}_t \in \partial f(\boldsymbol{\theta}_t)

Unlike gradient descent, subgradient descent does not guarantee descent at every step. The subgradient direction may not be a descent direction. Convergence requires a diminishing step size:

\alpha_t \to 0 \quad \text{and} \quad \sum_{t=1}^{\infty} \alpha_t = \infty

A common choice is $\alpha_t = c / \sqrt{t}$ . The convergence rate is $O(1/\sqrt{t})$ — slower than gradient descent’s $O(1/t)$ for smooth functions.

Key insight: Subgradient methods are simple but slow. They work for any convex function but converge at the worst-case rate. For structured non-smooth problems (like L1 regularization), proximal methods are much faster.

Proximal Operators

The proximal operator of a function $h$ is:

\text{prox}_{\alpha h}(\mathbf{v}) = \arg\min_{\mathbf{x}} \left\{h(\mathbf{x}) + \frac{1}{2\alpha}\|\mathbf{x} - \mathbf{v}\|^2\right\}

It finds the point that balances minimizing $h$ with staying close to $\mathbf{v}$ . The parameter $\alpha$ controls the trade-off.

Proximal Operator of the L1 Norm

For $h(\mathbf{x}) = \lambda\|\mathbf{x}\|_1$ , the proximal operator is soft thresholding:

[\text{prox}_{\alpha\lambda\|\cdot\|_1}(\mathbf{v})]_i = \text{sign}(v_i) \max(|v_i| - \alpha\lambda, 0)

This shrinks each component toward zero by $\alpha\lambda$ . Components smaller than $\alpha\lambda$ are set to exactly zero — this is the mechanism behind L1 sparsity.

Key insight: Soft thresholding is why Lasso regression produces sparse solutions. The proximal operator of the L1 norm literally zeros out small coefficients. This is fundamentally different from L2 regularization, which shrinks coefficients but never sets them to exactly zero.

Other Important Proximal Operators

Function $h(\mathbf{x})$	Proximal operator
$\frac{\lambda}{2}\\|\mathbf{x}\\|^2$ (L2)	$\frac{\mathbf{v}}{1 + \alpha\lambda}$ (scaling)
$\lambda\\|\mathbf{x}\\|_1$ (L1)	Soft thresholding
$I_C(\mathbf{x})$ (indicator of set $C$ )	Projection onto $C$
$\lambda\\|\mathbf{x}\\|_*$ (nuclear norm)	Singular value thresholding

The indicator function $I_C(\mathbf{x}) = 0$ if $\mathbf{x} \in C$ , $+\infty$ otherwise. Its proximal operator is the Euclidean projection onto $C$ — connecting proximal methods to constrained optimization.

Proximal Gradient Descent

When the objective splits into a smooth part $f$ and a non-smooth part $h$ :

\min_{\mathbf{x}} f(\mathbf{x}) + h(\mathbf{x})

Proximal gradient descent alternates between a gradient step on $f$ and a proximal step on $h$ :

\boldsymbol{\theta}_{t+1} = \text{prox}_{\alpha h}\left(\boldsymbol{\theta}_t - \alpha \nabla f(\boldsymbol{\theta}_t)\right)

This is also called ISTA (Iterative Shrinkage-Thresholding Algorithm) when applied to L1-regularized problems.

Application: Lasso Regression

The Lasso objective $\mathcal{L} = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|_1$ splits as:

Smooth: $f(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$
Non-smooth: $h(\mathbf{w}) = \lambda\|\mathbf{w}\|_1$

Proximal gradient descent:

\mathbf{w}_{t+1} = \text{soft}_{\alpha\lambda}\left(\mathbf{w}_t - \frac{\alpha}{n}\mathbf{X}^T(\mathbf{X}\mathbf{w}_t - \mathbf{y})\right)

where $\text{soft}_\tau(v) = \text{sign}(v)\max(|v| - \tau, 0)$ .

FISTA (Fast ISTA)

Adding Nesterov-style acceleration to proximal gradient descent gives FISTA, improving convergence from $O(1/t)$ to $O(1/t^2)$ :

\begin{aligned} \mathbf{z}_{t+1} &= \text{prox}_{\alpha h}\left(\mathbf{y}_t - \alpha \nabla f(\mathbf{y}_t)\right) \\[6pt] t_{t+1} &= \frac{1 + \sqrt{1 + 4t_t^2}}{2} \\[6pt] \mathbf{y}_{t+1} &= \mathbf{z}_{t+1} + \frac{t_t - 1}{t_{t+1}}(\mathbf{z}_{t+1} - \mathbf{z}_t) \end{aligned}

ADMM

The Alternating Direction Method of Multipliers (ADMM) solves problems of the form:

\min_{\mathbf{x}, \mathbf{z}} f(\mathbf{x}) + g(\mathbf{z}) \quad \text{subject to} \quad \mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{z} = \mathbf{c}

ADMM iterates three steps:

\begin{aligned} \mathbf{x}_{t+1} &= \arg\min_\mathbf{x} \left\{f(\mathbf{x}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{z}_t - \mathbf{c} + \mathbf{u}_t\|^2\right\} \\[6pt] \mathbf{z}_{t+1} &= \arg\min_\mathbf{z} \left\{g(\mathbf{z}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x}_{t+1} + \mathbf{B}\mathbf{z} - \mathbf{c} + \mathbf{u}_t\|^2\right\} \\[6pt] \mathbf{u}_{t+1} &= \mathbf{u}_t + \mathbf{A}\mathbf{x}_{t+1} + \mathbf{B}\mathbf{z}_{t+1} - \mathbf{c} \end{aligned}

ADMM decomposes the problem into simpler subproblems that often have closed-form solutions. It is particularly useful for distributed optimization (splitting data across machines) and for problems with multiple non-smooth terms.

ADMM for L1 + L2 Regularization (Elastic Net)

For $\min \frac{1}{2}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda_1\|\mathbf{w}\|_1 + \frac{\lambda_2}{2}\|\mathbf{w}\|^2$ , ADMM splits into:

$\mathbf{x}$ -step: Ridge regression (smooth, closed-form)
$\mathbf{z}$ -step: Soft thresholding (proximal of L1)
$\mathbf{u}$ -step: Simple addition

Non-Smooth Optimization in Deep Learning

ReLU and the “Gradient” at Zero

ReLU $f(x) = \max(0, x)$ is not differentiable at $x = 0$ . In practice, frameworks define:

\text{ReLU}'(x) = \begin{cases} 0 & x < 0 \\ 1 & x > 0 \\ 0 \text{ or } 1 & x = 0 \end{cases}

This works because $P(x = 0) = 0$ for continuous input distributions — the non-differentiable point has measure zero.

Straight-Through Estimator

For truly non-differentiable operations like quantization (rounding to integers), we cannot even define a useful subgradient. The Straight-Through Estimator (STE) simply passes the gradient through the non-differentiable operation unchanged:

\frac{\partial \text{round}(x)}{\partial x} \approx 1

STE is mathematically unjustified but empirically effective. It enables:

Binary neural networks (weights in $\{-1, +1\}$ )
Quantization-aware training (INT8 models)
Discrete latent variables (VQ-VAE)

Gumbel-Softmax

For differentiable sampling from categorical distributions, the Gumbel-Softmax relaxation replaces discrete samples with continuous approximations:

y_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}

where $g_i$ are Gumbel noise samples and $\tau$ is a temperature. As $\tau \to 0$ , this approaches a discrete sample. The gradient is well-defined for $\tau > 0$ , enabling backpropagation through discrete choices.

Model Pruning and Sparsity

Non-smooth optimization directly enables model compression:

L1 regularization: Produces sparse weights (many exactly zero) via the proximal/soft-thresholding mechanism
Group Lasso: $\sum_g \|\mathbf{w}_g\|_2$ zeros out entire groups (neurons, filters)
Iterative magnitude pruning: Train, remove smallest weights, retrain — structured sparsity
$L_0$ regularization: Directly penalize the number of nonzero weights (NP-hard in general, relaxed via continuous approximations)

Key insight: Sparsity is fundamentally a non-smooth optimization problem. L1 and its variants produce exact zeros because the proximal operator (soft thresholding) has a “dead zone” that forces small values to zero. L2 regularization shrinks but never zeros — this is why L1 is preferred for feature selection and pruning.

Why This Matters for ML

Non-smooth optimization is pervasive in modern ML:

L1/Lasso produces sparse models and enables feature selection — proximal methods are the standard solver
ReLU is non-differentiable but works due to measure-zero arguments
Quantization for model compression relies on straight-through estimators
ADMM enables distributed optimization across multiple machines
Pruning uses L1 and group sparsity to compress neural networks by 10-100x
Understanding non-smooth optimization explains when gradient-based methods can and cannot be used

Summary

Subgradients generalize gradients to non-differentiable convex functions via supporting hyperplanes
Subgradient descent is simple but converges slowly at $O(1/\sqrt{t})$
Proximal operators handle non-smooth terms efficiently — the L1 proximal is soft thresholding
Proximal gradient descent (ISTA/FISTA) splits objectives into smooth + non-smooth parts, achieving $O(1/t^2)$ with acceleration
ADMM decomposes problems into simpler subproblems, useful for distributed optimization
ReLU’s non-differentiability at zero is harmless because it has measure zero
Straight-through estimators enable backprop through discrete operations (quantization, binary networks)
L1 sparsity comes from soft thresholding — the proximal operator zeros out small weights
Next: optimization landscape of neural networks examines the loss surfaces that these algorithms navigate

References

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 6. stanford.edu
Parikh, N., & Boyd, S. (2014). Proximal Algorithms. Foundations and Trends in Optimization, 1(3), 127-239.
Beck, A., & Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183-202.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed Optimization and Statistical Learning via ADMM. Foundations and Trends in Machine Learning, 3(1), 1-122.
Bengio, Y., Leonard, N., & Courville, A. (2013). Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv:1308.3432