- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Why Go Beyond First-Order Methods?
Gradient descent and Adam use only first-order information — the gradient. They ignore curvature, which leads to problems on ill-conditioned loss surfaces: zigzagging in steep directions and crawling in flat ones.
Second-order methods use curvature (the Hessian) to take smarter steps. Natural gradient methods go further — they account for the geometry of the parameter space itself, not just the loss surface.
These methods are computationally expensive for large models, but understanding them explains why Adam works, motivates better optimizers, and is essential for advanced topics like meta-learning and Bayesian deep learning.
Recap: Newton’s Method
From our Taylor series article, Newton’s method minimizes the second-order Taylor approximation at each step:
where is the Hessian.
Newton’s method has quadratic convergence — the number of correct digits doubles each step. But it requires:
- Computing : entries for parameters
- Inverting : computation
- Storing : memory
For a model with parameters, the Hessian has entries — completely infeasible. The rest of this article explores practical approximations.
The Gauss-Newton Method
For least-squares problems where is the residual vector, the Hessian is:
where is the Jacobian of the residuals. The Gauss-Newton approximation drops the second-order term (which is small near the solution):
This approximation is always positive semidefinite (unlike the full Hessian, which can be indefinite), making the method more stable. The Gauss-Newton update:
Key insight: Gauss-Newton avoids computing second derivatives entirely — it only needs the Jacobian of the residuals. This makes it much cheaper than full Newton while retaining superlinear convergence near the solution. The Levenberg-Marquardt algorithm adds a damping term for stability far from the solution.
The Fisher Information Matrix
The Fisher Information Matrix (FIM) measures how much information the data carries about the model parameters:
This is the expected outer product of the score function .
Key Properties
- is always positive semidefinite
- equals the negative expected Hessian of the log-likelihood:
- For exponential family models, has a clean closed-form expression
- defines a Riemannian metric on parameter space — it measures “distances” between distributions
The Empirical Fisher
In practice, we approximate using the training data:
Warning: The empirical Fisher (using data labels) is not the same as the true Fisher (using model samples). The distinction matters theoretically, though both are used in practice.
Natural Gradient Descent
Standard gradient descent moves in the direction of steepest descent in Euclidean parameter space — it treats all parameter directions equally. But parameters are not all equal. A small change in one weight might dramatically change the output distribution, while a large change in another might barely affect it.
Natural gradient descent accounts for this by measuring steepest descent in distribution space, using the Fisher information as a metric:
The Fisher inverse rescales the gradient so that equal steps correspond to equal changes in the output distribution, not equal changes in parameter values.
Why Natural Gradient is Better
Consider reparameterizing a model: . Standard gradient descent gives different trajectories for vs — the optimization depends on the arbitrary choice of parameterization.
Natural gradient is parameterization-invariant: it gives the same trajectory regardless of how parameters are represented. This is because the Fisher information transforms like a metric tensor, automatically accounting for the parameterization.
Key insight: Natural gradient descent is the “correct” way to do gradient descent on probability distributions. It avoids the pathologies of standard gradient descent when the parameter space has a non-Euclidean geometry — which is the case for virtually all probabilistic models. Adam can be seen as a diagonal approximation to the natural gradient.
Connection to Adam
Adam’s update approximately rescales each parameter by the inverse square root of its typical gradient magnitude. This is a diagonal approximation to the natural gradient:
This connection explains why Adam works so well with minimal tuning — it implicitly accounts for the geometry of parameter space, albeit only along the coordinate axes.
K-FAC: Kronecker-Factored Approximate Curvature
K-FAC makes natural gradient practical for deep networks by exploiting the structure of neural network layers.
The Key Approximation
For a fully connected layer , the Fisher information block for is:
where is the input activation, is the output gradient, and is the Kronecker product.
K-FAC approximates this as a Kronecker product of two smaller matrices:
where and .
Why Kronecker Structure Helps
If is , the full Fisher block is — storing and inverting it is . The Kronecker factorization stores two matrices of sizes and , and inversion uses:
This reduces cost from to — a massive reduction. For a layer with 1000 inputs and 1000 outputs, the speedup is .
K-FAC in Practice
K-FAC maintains running averages of and for each layer, periodically inverts them, and uses the Kronecker-factored inverse Fisher for parameter updates. It typically converges in fewer iterations than Adam, though each iteration is more expensive.
Hessian-Free Optimization
Hessian-free (truncated Newton) methods avoid storing the full Hessian by computing Hessian-vector products efficiently.
The key identity (Pearlmutter, 1994):
This computes using one forward and one backward pass — the same cost as computing the gradient itself. No need to store .
With Hessian-vector products, we can solve using the conjugate gradient (CG) algorithm, which only requires matrix-vector products. A few CG iterations give an approximate Newton direction without ever forming the Hessian.
Comparison of Methods
| Method | Uses | Per-step cost | Memory | Convergence |
|---|---|---|---|---|
| GD | Linear | |||
| Adam | + moments | Adaptive linear | ||
| L-BFGS | history | Superlinear | ||
| K-FAC | Kronecker Fisher | Near-quadratic | ||
| Hessian-free | products | per CG step | Near-quadratic | |
| Newton | Full | Quadratic |
where is the total parameter count, is the dimension of layer , is the L-BFGS memory size, and is the number of CG iterations.
Practical Considerations
When to Use Second-Order Methods
- Small to medium models (): L-BFGS or K-FAC can significantly speed up training
- Supervised learning with well-defined loss: K-FAC works well for classification and regression
- Fine-tuning: When starting near a good solution, second-order methods converge faster
- Research and understanding: Hessian analysis reveals loss landscape structure
When First-Order Methods Win
- Very large models (LLMs, large vision models): Adam/AdamW with careful scheduling is hard to beat due to simplicity and parallelism
- GANs and adversarial training: The loss landscape is not well-approximated by a quadratic
- Reinforcement learning: Non-stationary objectives make curvature estimates unreliable
Key insight: Second-order methods converge in fewer iterations but each iteration is more expensive. The total wall-clock time depends on the trade-off. For large-scale deep learning, first-order methods usually win because they parallelize better on GPUs. But understanding second-order methods illuminates why optimizers like Adam work and how to design better ones.
Why This Matters for ML
Second-order and natural gradient methods provide the theoretical backbone for practical optimization:
- Adam is a diagonal natural gradient — understanding the Fisher explains why Adam adapts so well
- K-FAC achieves near-second-order convergence at tractable cost for medium-scale networks
- The Fisher information defines the geometry of distribution space — essential for understanding why some parameterizations train better
- Hessian-vector products enable loss landscape analysis (eigenvalue spectra, saddle point detection)
- Gauss-Newton connects to generalized linear models and is the theoretical basis for many classical ML optimizers
Summary
- Newton’s method uses the full Hessian for quadratic convergence but costs — infeasible for large models
- Gauss-Newton approximates the Hessian using only the Jacobian — always positive semidefinite, cheaper, and stable
- The Fisher Information Matrix measures sensitivity of the model output to parameter changes and defines a Riemannian metric on parameter space
- Natural gradient descent uses — the correct descent direction in distribution space, invariant to reparameterization
- Adam diagonal natural gradient — this connection explains its robustness
- K-FAC makes natural gradient practical by factoring the Fisher as Kronecker products per layer
- Hessian-free methods compute products cheaply, enabling conjugate gradient solvers
- Next: numerical stability addresses the practical challenges of implementing these methods
References
- Amari, S. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251-276.
- Martens, J. (2010). Deep Learning via Hessian-Free Optimization. ICML.
- Martens, J., & Grosse, R. (2015). Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. ICML. arXiv:1503.05671
- Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 6-7.
- Pascanu, R., & Bengio, Y. (2013). Revisiting Natural Gradient for Deep Networks. ICLR.
- Kunstner, F., Balles, L., & Hennig, P. (2019). Limitations of the Empirical Fisher Approximation. NeurIPS.