Non-Smooth Optimization and Proximal Methods

Handle non-differentiable objectives with subgradients, proximal operators, and ADMM — the tools behind L1 sparsity, pruning, and robust losses.

Calculus & Optimization March 7, 2026 8 min read

When Gradients Do Not Exist

The optimization methods we have studied — gradient descent, Adam — assume the objective is differentiable. But many important ML objectives are not:

  • L1 regularization w1=wi\|\mathbf{w}\|_1 = \sum |w_i| has a kink at every wi=0w_i = 0
  • ReLU max(0,x)\max(0, x) is not differentiable at x=0x = 0
  • Hinge loss max(0,1yy^)\max(0, 1 - y\hat{y}) has a corner at yy^=1y\hat{y} = 1
  • Max-pooling selects the maximum element — non-smooth at ties
  • Quantization (rounding to integers) is piecewise constant — zero derivative almost everywhere

These functions are non-smooth: they have points where the derivative does not exist. We need tools that generalize gradient descent to handle them.

Subgradients

A subgradient of a convex function ff at x\mathbf{x} is any vector g\mathbf{g} satisfying:

f(y)f(x)+gT(yx)yf(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{g}^T(\mathbf{y} - \mathbf{x}) \quad \forall \, \mathbf{y}

Geometrically, g\mathbf{g} defines a supporting hyperplane that touches the graph of ff at x\mathbf{x} and lies below ff everywhere.

At differentiable points, the only subgradient is the gradient: g=f(x)\mathbf{g} = \nabla f(\mathbf{x}). At non-differentiable points, there are multiple subgradients — the set of all subgradients is called the subdifferential f(x)\partial f(\mathbf{x}).

Example: Absolute Value

For f(x)=xf(x) = |x|:

f(x)={{1}x<0[1,1]x=0{1}x>0\partial f(x) = \begin{cases} \{-1\} & x < 0 \\ [-1, 1] & x = 0 \\ \{1\} & x > 0 \end{cases}

At x=0x = 0, any value in [1,1][-1, 1] is a valid subgradient. At all other points, the subdifferential is a single point (the derivative).

Example: L1 Norm

For f(w)=w1=iwif(\mathbf{w}) = \|\mathbf{w}\|_1 = \sum_i |w_i|, the subdifferential is component-wise:

[f(w)]i={{1}wi<0[1,1]wi=0{1}wi>0[\partial f(\mathbf{w})]_i = \begin{cases} \{-1\} & w_i < 0 \\ [-1, 1] & w_i = 0 \\ \{1\} & w_i > 0 \end{cases}

This is the sign function with a set-valued extension at zero: w1=sign(w)\partial \|\mathbf{w}\|_1 = \text{sign}(\mathbf{w}).

Subgradient Descent

Replace the gradient with a subgradient:

θt+1=θtαtgt,gtf(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha_t \mathbf{g}_t, \quad \mathbf{g}_t \in \partial f(\boldsymbol{\theta}_t)

Unlike gradient descent, subgradient descent does not guarantee descent at every step. The subgradient direction may not be a descent direction. Convergence requires a diminishing step size:

αt0andt=1αt=\alpha_t \to 0 \quad \text{and} \quad \sum_{t=1}^{\infty} \alpha_t = \infty

A common choice is αt=c/t\alpha_t = c / \sqrt{t}. The convergence rate is O(1/t)O(1/\sqrt{t}) — slower than gradient descent’s O(1/t)O(1/t) for smooth functions.

Key insight: Subgradient methods are simple but slow. They work for any convex function but converge at the worst-case rate. For structured non-smooth problems (like L1 regularization), proximal methods are much faster.

Proximal Operators

The proximal operator of a function hh is:

proxαh(v)=argminx{h(x)+12αxv2}\text{prox}_{\alpha h}(\mathbf{v}) = \arg\min_{\mathbf{x}} \left\{h(\mathbf{x}) + \frac{1}{2\alpha}\|\mathbf{x} - \mathbf{v}\|^2\right\}

It finds the point that balances minimizing hh with staying close to v\mathbf{v}. The parameter α\alpha controls the trade-off.

Proximal Operator of the L1 Norm

For h(x)=λx1h(\mathbf{x}) = \lambda\|\mathbf{x}\|_1, the proximal operator is soft thresholding:

[proxαλ1(v)]i=sign(vi)max(viαλ,0)[\text{prox}_{\alpha\lambda\|\cdot\|_1}(\mathbf{v})]_i = \text{sign}(v_i) \max(|v_i| - \alpha\lambda, 0)

This shrinks each component toward zero by αλ\alpha\lambda. Components smaller than αλ\alpha\lambda are set to exactly zero — this is the mechanism behind L1 sparsity.

Key insight: Soft thresholding is why Lasso regression produces sparse solutions. The proximal operator of the L1 norm literally zeros out small coefficients. This is fundamentally different from L2 regularization, which shrinks coefficients but never sets them to exactly zero.

Other Important Proximal Operators

Function h(x)h(\mathbf{x})Proximal operator
λ2x2\frac{\lambda}{2}\|\mathbf{x}\|^2 (L2)v1+αλ\frac{\mathbf{v}}{1 + \alpha\lambda} (scaling)
λx1\lambda\|\mathbf{x}\|_1 (L1)Soft thresholding
IC(x)I_C(\mathbf{x}) (indicator of set CC)Projection onto CC
λx\lambda\|\mathbf{x}\|_* (nuclear norm)Singular value thresholding

The indicator function IC(x)=0I_C(\mathbf{x}) = 0 if xC\mathbf{x} \in C, ++\infty otherwise. Its proximal operator is the Euclidean projection onto CC — connecting proximal methods to constrained optimization.

Proximal Gradient Descent

When the objective splits into a smooth part ff and a non-smooth part hh:

minxf(x)+h(x)\min_{\mathbf{x}} f(\mathbf{x}) + h(\mathbf{x})

Proximal gradient descent alternates between a gradient step on ff and a proximal step on hh:

θt+1=proxαh(θtαf(θt))\boldsymbol{\theta}_{t+1} = \text{prox}_{\alpha h}\left(\boldsymbol{\theta}_t - \alpha \nabla f(\boldsymbol{\theta}_t)\right)

This is also called ISTA (Iterative Shrinkage-Thresholding Algorithm) when applied to L1-regularized problems.

Application: Lasso Regression

The Lasso objective L=12nyXw2+λw1\mathcal{L} = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|_1 splits as:

  • Smooth: f(w)=12nyXw2f(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2
  • Non-smooth: h(w)=λw1h(\mathbf{w}) = \lambda\|\mathbf{w}\|_1

Proximal gradient descent:

wt+1=softαλ(wtαnXT(Xwty))\mathbf{w}_{t+1} = \text{soft}_{\alpha\lambda}\left(\mathbf{w}_t - \frac{\alpha}{n}\mathbf{X}^T(\mathbf{X}\mathbf{w}_t - \mathbf{y})\right)

where softτ(v)=sign(v)max(vτ,0)\text{soft}_\tau(v) = \text{sign}(v)\max(|v| - \tau, 0).

FISTA (Fast ISTA)

Adding Nesterov-style acceleration to proximal gradient descent gives FISTA, improving convergence from O(1/t)O(1/t) to O(1/t2)O(1/t^2):

zt+1=proxαh(ytαf(yt))tt+1=1+1+4tt22yt+1=zt+1+tt1tt+1(zt+1zt)\begin{aligned} \mathbf{z}_{t+1} &= \text{prox}_{\alpha h}\left(\mathbf{y}_t - \alpha \nabla f(\mathbf{y}_t)\right) \\[6pt] t_{t+1} &= \frac{1 + \sqrt{1 + 4t_t^2}}{2} \\[6pt] \mathbf{y}_{t+1} &= \mathbf{z}_{t+1} + \frac{t_t - 1}{t_{t+1}}(\mathbf{z}_{t+1} - \mathbf{z}_t) \end{aligned}

ADMM

The Alternating Direction Method of Multipliers (ADMM) solves problems of the form:

minx,zf(x)+g(z)subject toAx+Bz=c\min_{\mathbf{x}, \mathbf{z}} f(\mathbf{x}) + g(\mathbf{z}) \quad \text{subject to} \quad \mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{z} = \mathbf{c}

ADMM iterates three steps:

xt+1=argminx{f(x)+ρ2Ax+Bztc+ut2}zt+1=argminz{g(z)+ρ2Axt+1+Bzc+ut2}ut+1=ut+Axt+1+Bzt+1c\begin{aligned} \mathbf{x}_{t+1} &= \arg\min_\mathbf{x} \left\{f(\mathbf{x}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{z}_t - \mathbf{c} + \mathbf{u}_t\|^2\right\} \\[6pt] \mathbf{z}_{t+1} &= \arg\min_\mathbf{z} \left\{g(\mathbf{z}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x}_{t+1} + \mathbf{B}\mathbf{z} - \mathbf{c} + \mathbf{u}_t\|^2\right\} \\[6pt] \mathbf{u}_{t+1} &= \mathbf{u}_t + \mathbf{A}\mathbf{x}_{t+1} + \mathbf{B}\mathbf{z}_{t+1} - \mathbf{c} \end{aligned}

ADMM decomposes the problem into simpler subproblems that often have closed-form solutions. It is particularly useful for distributed optimization (splitting data across machines) and for problems with multiple non-smooth terms.

ADMM for L1 + L2 Regularization (Elastic Net)

For min12yXw2+λ1w1+λ22w2\min \frac{1}{2}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda_1\|\mathbf{w}\|_1 + \frac{\lambda_2}{2}\|\mathbf{w}\|^2, ADMM splits into:

  • x\mathbf{x}-step: Ridge regression (smooth, closed-form)
  • z\mathbf{z}-step: Soft thresholding (proximal of L1)
  • u\mathbf{u}-step: Simple addition

Non-Smooth Optimization in Deep Learning

ReLU and the “Gradient” at Zero

ReLU f(x)=max(0,x)f(x) = \max(0, x) is not differentiable at x=0x = 0. In practice, frameworks define:

ReLU(x)={0x<01x>00 or 1x=0\text{ReLU}'(x) = \begin{cases} 0 & x < 0 \\ 1 & x > 0 \\ 0 \text{ or } 1 & x = 0 \end{cases}

This works because P(x=0)=0P(x = 0) = 0 for continuous input distributions — the non-differentiable point has measure zero.

Straight-Through Estimator

For truly non-differentiable operations like quantization (rounding to integers), we cannot even define a useful subgradient. The Straight-Through Estimator (STE) simply passes the gradient through the non-differentiable operation unchanged:

round(x)x1\frac{\partial \text{round}(x)}{\partial x} \approx 1

STE is mathematically unjustified but empirically effective. It enables:

  • Binary neural networks (weights in {1,+1}\{-1, +1\})
  • Quantization-aware training (INT8 models)
  • Discrete latent variables (VQ-VAE)

Gumbel-Softmax

For differentiable sampling from categorical distributions, the Gumbel-Softmax relaxation replaces discrete samples with continuous approximations:

yi=exp((logπi+gi)/τ)jexp((logπj+gj)/τ)y_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}

where gig_i are Gumbel noise samples and τ\tau is a temperature. As τ0\tau \to 0, this approaches a discrete sample. The gradient is well-defined for τ>0\tau > 0, enabling backpropagation through discrete choices.

Model Pruning and Sparsity

Non-smooth optimization directly enables model compression:

  • L1 regularization: Produces sparse weights (many exactly zero) via the proximal/soft-thresholding mechanism
  • Group Lasso: gwg2\sum_g \|\mathbf{w}_g\|_2 zeros out entire groups (neurons, filters)
  • Iterative magnitude pruning: Train, remove smallest weights, retrain — structured sparsity
  • L0L_0 regularization: Directly penalize the number of nonzero weights (NP-hard in general, relaxed via continuous approximations)

Key insight: Sparsity is fundamentally a non-smooth optimization problem. L1 and its variants produce exact zeros because the proximal operator (soft thresholding) has a “dead zone” that forces small values to zero. L2 regularization shrinks but never zeros — this is why L1 is preferred for feature selection and pruning.

Why This Matters for ML

Non-smooth optimization is pervasive in modern ML:

  • L1/Lasso produces sparse models and enables feature selection — proximal methods are the standard solver
  • ReLU is non-differentiable but works due to measure-zero arguments
  • Quantization for model compression relies on straight-through estimators
  • ADMM enables distributed optimization across multiple machines
  • Pruning uses L1 and group sparsity to compress neural networks by 10-100x
  • Understanding non-smooth optimization explains when gradient-based methods can and cannot be used

Summary

  • Subgradients generalize gradients to non-differentiable convex functions via supporting hyperplanes
  • Subgradient descent is simple but converges slowly at O(1/t)O(1/\sqrt{t})
  • Proximal operators handle non-smooth terms efficiently — the L1 proximal is soft thresholding
  • Proximal gradient descent (ISTA/FISTA) splits objectives into smooth + non-smooth parts, achieving O(1/t2)O(1/t^2) with acceleration
  • ADMM decomposes problems into simpler subproblems, useful for distributed optimization
  • ReLU’s non-differentiability at zero is harmless because it has measure zero
  • Straight-through estimators enable backprop through discrete operations (quantization, binary networks)
  • L1 sparsity comes from soft thresholding — the proximal operator zeros out small weights
  • Next: optimization landscape of neural networks examines the loss surfaces that these algorithms navigate

References

  • Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 6. stanford.edu
  • Parikh, N., & Boyd, S. (2014). Proximal Algorithms. Foundations and Trends in Optimization, 1(3), 127-239.
  • Beck, A., & Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183-202.
  • Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed Optimization and Statistical Learning via ADMM. Foundations and Trends in Machine Learning, 3(1), 1-122.
  • Bengio, Y., Leonard, N., & Courville, A. (2013). Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv:1308.3432

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay