- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
The Optimization Problem
Training a machine learning model means finding parameters that minimize a loss function:
For linear regression, we can solve this analytically (normal equations). For neural networks, the loss is a complicated nonlinear function of millions of parameters — no closed-form solution exists. We need an iterative algorithm.
Gradient descent is that algorithm. It uses the first-order Taylor approximation to make a small improvement at each step, and repeats until convergence.
The Gradient Descent Algorithm
The update rule is remarkably simple:
where is the learning rate (step size) and is the gradient of the loss.
Intuition: Think of the loss surface as a hilly landscape and the parameters as your position. The gradient points uphill. Taking a step in the direction is like walking downhill along the steepest slope. The learning rate controls how big a step you take.
The full algorithm:
import numpy as np
def gradient_descent(grad_fn, theta_init, lr=0.01, n_steps=1000):
theta = theta_init.copy()
for t in range(n_steps):
grad = grad_fn(theta)
theta = theta - lr * grad
return theta
At each iteration:
- Compute the gradient
- Take a step in the negative gradient direction
- Repeat until the gradient is approximately zero
The Learning Rate
The learning rate is the most important hyperparameter in gradient descent. It controls the trade-off between speed and stability.
Too Large: Divergence
If is too large, the algorithm overshoots the minimum and can diverge:
The values explode because the step overshoots further each time.
Too Small: Slow Convergence
If is very small, the algorithm converges but painfully slowly. For with :
After steps, . Reaching requires steps.
Just Right
For , the optimal learning rate is :
One step to the exact minimum. In general, the optimal rate depends on the curvature (second derivative), which motivates second-order methods and adaptive optimizers.
Key insight: The maximum stable learning rate is , where is the Lipschitz constant of the gradient (the maximum curvature). Larger curvature demands smaller steps. This is why loss functions with high curvature are harder to optimize.
Batch Gradient Descent
In ML, the loss is typically an average over the training set:
Batch gradient descent computes the full gradient using all samples:
This gives the exact gradient direction at each step.
| Pros | Cons |
|---|---|
| Deterministic — same path every run | Expensive: per update |
| Smooth convergence | Memory-intensive for large |
| Guaranteed descent per step | Cannot exploit data redundancy |
For datasets with millions of samples, computing the full gradient at every step is wasteful — many samples provide redundant information. This motivates stochastic gradient descent.
Convergence Analysis
Convex Functions
For convex functions with -Lipschitz gradients () and learning rate :
This means gradient descent converges at rate — to halve the error, you need twice as many steps.
Strongly Convex Functions
If is also strongly convex with parameter (minimum curvature), convergence is exponentially faster:
The ratio is the condition number. A large means the loss surface is elongated (some directions have high curvature, others low), and gradient descent zigzags slowly. A well-conditioned problem ( close to 1) converges quickly.
Key insight: The condition number determines how many iterations gradient descent needs. Ill-conditioned problems (large ) cause zigzagging. This is the fundamental motivation for preconditioning and adaptive learning rates.
Sufficient Decrease (Armijo Condition)
In practice, we often want a formal guarantee that each step reduces the loss. The Armijo condition requires:
where is a small constant (typically ). Line search methods find the largest satisfying this condition.
Local Minima and Saddle Points
For non-convex functions (like neural network losses), the landscape is complex.
Critical Points
A critical point is where . Critical points can be:
- Local minimum: loss increases in all directions (Hessian positive definite)
- Local maximum: loss decreases in all directions (Hessian negative definite)
- Saddle point: loss increases in some directions and decreases in others (Hessian indefinite)
Saddle Points Dominate in High Dimensions
For a random function in dimensions, a critical point is a local minimum only if all Hessian eigenvalues are positive. If each eigenvalue is equally likely to be positive or negative, the probability is — vanishingly small for large .
Key insight: In the high-dimensional parameter spaces of neural networks, most critical points are saddle points, not local minima. Gradient descent naturally escapes saddle points because the gradient is nonzero in the directions of negative curvature. The real challenge is navigating flat regions where the gradient is small but nonzero.
Local Minima are Often Good Enough
Empirical evidence suggests that local minima in deep networks tend to have similar loss values to the global minimum. The concern is not getting stuck in bad local minima but rather the speed of convergence through flat regions and narrow valleys.
Worked Example: Linear Regression
Consider fitting with MSE loss over 4 data points: .
The gradients:
import numpy as np
X = np.array([1, 2, 3, 4])
y = np.array([2, 4, 5, 4])
w1, w2 = 0.0, 0.0 # initial parameters
lr = 0.01
for step in range(500):
pred = w1 * X + w2
residual = y - pred
grad_w1 = -2 * np.mean(X * residual)
grad_w2 = -2 * np.mean(residual)
w1 -= lr * grad_w1
w2 -= lr * grad_w2
print(f"w1 = {w1:.4f}, w2 = {w2:.4f}")
# w1 ≈ 0.8, w2 ≈ 1.5
The parameters converge to the least-squares solution. With a larger learning rate, convergence is faster but risks instability. With a smaller rate, it is slower but more stable.
Gradient Descent Variants Preview
Batch gradient descent is the foundation, but practical training uses several improvements:
- Stochastic Gradient Descent (SGD): Uses a random subset of data per step — much faster for large datasets
- Momentum: Accumulates velocity to smooth out oscillations and accelerate through flat regions
- Adaptive methods: Adam, RMSProp, AdaGrad — automatically adjust learning rates per parameter
These build on the foundation established here, addressing the practical limitations of vanilla gradient descent.
Why This Matters for ML
Gradient descent is the algorithm that trains virtually every neural network and many classical ML models:
- It is the training loop: forward pass compute loss backward pass update parameters
- Learning rate is the single most impactful hyperparameter — it determines whether training converges, how fast, and to what quality solution
- Convergence rate depends on the condition number of the loss surface — ill-conditioned problems need more iterations or smarter optimizers
- Non-convexity in deep learning means multiple local minima and saddle points, but empirically gradient descent finds good solutions
- Every variant (SGD, Adam, etc.) is a modification of this core algorithm
Summary
- Gradient descent iterates to minimize the loss
- The learning rate must satisfy where is the gradient Lipschitz constant
- Batch gradient descent uses the full dataset — exact but expensive for large
- Convergence is for convex functions and for strongly convex
- The condition number determines convergence speed — ill-conditioned problems zigzag
- In high dimensions, saddle points vastly outnumber local minima
- Next: stochastic gradient descent trades exact gradients for speed
References
- Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 2-3.
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 9. stanford.edu
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8. deeplearningbook.org
- Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2), 223-311.
- Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms. arXiv:1609.04747