- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
When Gradients Do Not Exist
The optimization methods we have studied — gradient descent, Adam — assume the objective is differentiable. But many important ML objectives are not:
- L1 regularization has a kink at every
- ReLU is not differentiable at
- Hinge loss has a corner at
- Max-pooling selects the maximum element — non-smooth at ties
- Quantization (rounding to integers) is piecewise constant — zero derivative almost everywhere
These functions are non-smooth: they have points where the derivative does not exist. We need tools that generalize gradient descent to handle them.
Subgradients
A subgradient of a convex function at is any vector satisfying:
Geometrically, defines a supporting hyperplane that touches the graph of at and lies below everywhere.
At differentiable points, the only subgradient is the gradient: . At non-differentiable points, there are multiple subgradients — the set of all subgradients is called the subdifferential .
Example: Absolute Value
For :
At , any value in is a valid subgradient. At all other points, the subdifferential is a single point (the derivative).
Example: L1 Norm
For , the subdifferential is component-wise:
This is the sign function with a set-valued extension at zero: .
Subgradient Descent
Replace the gradient with a subgradient:
Unlike gradient descent, subgradient descent does not guarantee descent at every step. The subgradient direction may not be a descent direction. Convergence requires a diminishing step size:
A common choice is . The convergence rate is — slower than gradient descent’s for smooth functions.
Key insight: Subgradient methods are simple but slow. They work for any convex function but converge at the worst-case rate. For structured non-smooth problems (like L1 regularization), proximal methods are much faster.
Proximal Operators
The proximal operator of a function is:
It finds the point that balances minimizing with staying close to . The parameter controls the trade-off.
Proximal Operator of the L1 Norm
For , the proximal operator is soft thresholding:
This shrinks each component toward zero by . Components smaller than are set to exactly zero — this is the mechanism behind L1 sparsity.
Key insight: Soft thresholding is why Lasso regression produces sparse solutions. The proximal operator of the L1 norm literally zeros out small coefficients. This is fundamentally different from L2 regularization, which shrinks coefficients but never sets them to exactly zero.
Other Important Proximal Operators
| Function | Proximal operator |
|---|---|
| (L2) | (scaling) |
| (L1) | Soft thresholding |
| (indicator of set ) | Projection onto |
| (nuclear norm) | Singular value thresholding |
The indicator function if , otherwise. Its proximal operator is the Euclidean projection onto — connecting proximal methods to constrained optimization.
Proximal Gradient Descent
When the objective splits into a smooth part and a non-smooth part :
Proximal gradient descent alternates between a gradient step on and a proximal step on :
This is also called ISTA (Iterative Shrinkage-Thresholding Algorithm) when applied to L1-regularized problems.
Application: Lasso Regression
The Lasso objective splits as:
- Smooth:
- Non-smooth:
Proximal gradient descent:
where .
FISTA (Fast ISTA)
Adding Nesterov-style acceleration to proximal gradient descent gives FISTA, improving convergence from to :
ADMM
The Alternating Direction Method of Multipliers (ADMM) solves problems of the form:
ADMM iterates three steps:
ADMM decomposes the problem into simpler subproblems that often have closed-form solutions. It is particularly useful for distributed optimization (splitting data across machines) and for problems with multiple non-smooth terms.
ADMM for L1 + L2 Regularization (Elastic Net)
For , ADMM splits into:
- -step: Ridge regression (smooth, closed-form)
- -step: Soft thresholding (proximal of L1)
- -step: Simple addition
Non-Smooth Optimization in Deep Learning
ReLU and the “Gradient” at Zero
ReLU is not differentiable at . In practice, frameworks define:
This works because for continuous input distributions — the non-differentiable point has measure zero.
Straight-Through Estimator
For truly non-differentiable operations like quantization (rounding to integers), we cannot even define a useful subgradient. The Straight-Through Estimator (STE) simply passes the gradient through the non-differentiable operation unchanged:
STE is mathematically unjustified but empirically effective. It enables:
- Binary neural networks (weights in )
- Quantization-aware training (INT8 models)
- Discrete latent variables (VQ-VAE)
Gumbel-Softmax
For differentiable sampling from categorical distributions, the Gumbel-Softmax relaxation replaces discrete samples with continuous approximations:
where are Gumbel noise samples and is a temperature. As , this approaches a discrete sample. The gradient is well-defined for , enabling backpropagation through discrete choices.
Model Pruning and Sparsity
Non-smooth optimization directly enables model compression:
- L1 regularization: Produces sparse weights (many exactly zero) via the proximal/soft-thresholding mechanism
- Group Lasso: zeros out entire groups (neurons, filters)
- Iterative magnitude pruning: Train, remove smallest weights, retrain — structured sparsity
- regularization: Directly penalize the number of nonzero weights (NP-hard in general, relaxed via continuous approximations)
Key insight: Sparsity is fundamentally a non-smooth optimization problem. L1 and its variants produce exact zeros because the proximal operator (soft thresholding) has a “dead zone” that forces small values to zero. L2 regularization shrinks but never zeros — this is why L1 is preferred for feature selection and pruning.
Why This Matters for ML
Non-smooth optimization is pervasive in modern ML:
- L1/Lasso produces sparse models and enables feature selection — proximal methods are the standard solver
- ReLU is non-differentiable but works due to measure-zero arguments
- Quantization for model compression relies on straight-through estimators
- ADMM enables distributed optimization across multiple machines
- Pruning uses L1 and group sparsity to compress neural networks by 10-100x
- Understanding non-smooth optimization explains when gradient-based methods can and cannot be used
Summary
- Subgradients generalize gradients to non-differentiable convex functions via supporting hyperplanes
- Subgradient descent is simple but converges slowly at
- Proximal operators handle non-smooth terms efficiently — the L1 proximal is soft thresholding
- Proximal gradient descent (ISTA/FISTA) splits objectives into smooth + non-smooth parts, achieving with acceleration
- ADMM decomposes problems into simpler subproblems, useful for distributed optimization
- ReLU’s non-differentiability at zero is harmless because it has measure zero
- Straight-through estimators enable backprop through discrete operations (quantization, binary networks)
- L1 sparsity comes from soft thresholding — the proximal operator zeros out small weights
- Next: optimization landscape of neural networks examines the loss surfaces that these algorithms navigate
References
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 6. stanford.edu
- Parikh, N., & Boyd, S. (2014). Proximal Algorithms. Foundations and Trends in Optimization, 1(3), 127-239.
- Beck, A., & Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183-202.
- Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed Optimization and Statistical Learning via ADMM. Foundations and Trends in Machine Learning, 3(1), 1-122.
- Bengio, Y., Leonard, N., & Courville, A. (2013). Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv:1308.3432