Second-Order and Natural Gradient Methods

Calculus & Optimization Series 13 / 18

Why Go Beyond First-Order Methods?

Gradient descent and Adam use only first-order information — the gradient. They ignore curvature, which leads to problems on ill-conditioned loss surfaces: zigzagging in steep directions and crawling in flat ones.

Second-order methods use curvature (the Hessian) to take smarter steps. Natural gradient methods go further — they account for the geometry of the parameter space itself, not just the loss surface.

These methods are computationally expensive for large models, but understanding them explains why Adam works, motivates better optimizers, and is essential for advanced topics like meta-learning and Bayesian deep learning.

Recap: Newton’s Method

From our Taylor series article, Newton’s method minimizes the second-order Taylor approximation at each step:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \mathbf{H}_t^{-1} \nabla \mathcal{L}(\boldsymbol{\theta}_t)

where $\mathbf{H}_t = \nabla^2 \mathcal{L}(\boldsymbol{\theta}_t)$ is the Hessian.

Newton’s method has quadratic convergence — the number of correct digits doubles each step. But it requires:

Computing $\mathbf{H}$ : $O(n^2)$ entries for $n$ parameters
Inverting $\mathbf{H}$ : $O(n^3)$ computation
Storing $\mathbf{H}$ : $O(n^2)$ memory

For a model with $n = 10^8$ parameters, the Hessian has $10^{16}$ entries — completely infeasible. The rest of this article explores practical approximations.

The Gauss-Newton Method

For least-squares problems $\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{2}\|\mathbf{r}(\boldsymbol{\theta})\|^2$ where $\mathbf{r}$ is the residual vector, the Hessian is:

\mathbf{H} = \mathbf{J}_r^T \mathbf{J}_r + \sum_i r_i \nabla^2 r_i

where $\mathbf{J}_r$ is the Jacobian of the residuals. The Gauss-Newton approximation drops the second-order term (which is small near the solution):

\mathbf{H} \approx \mathbf{J}_r^T \mathbf{J}_r

This approximation is always positive semidefinite (unlike the full Hessian, which can be indefinite), making the method more stable. The Gauss-Newton update:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - (\mathbf{J}_r^T \mathbf{J}_r)^{-1} \mathbf{J}_r^T \mathbf{r}

Key insight: Gauss-Newton avoids computing second derivatives entirely — it only needs the Jacobian of the residuals. This makes it much cheaper than full Newton while retaining superlinear convergence near the solution. The Levenberg-Marquardt algorithm adds a damping term $(\mathbf{J}^T\mathbf{J} + \lambda\mathbf{I})^{-1}$ for stability far from the solution.

The Fisher Information Matrix

The Fisher Information Matrix (FIM) measures how much information the data carries about the model parameters:

\mathbf{F} = \mathbb{E}_{p(\mathbf{x} \mid \boldsymbol{\theta})}\left[\nabla \log p(\mathbf{x} \mid \boldsymbol{\theta}) \, \nabla \log p(\mathbf{x} \mid \boldsymbol{\theta})^T\right]

This is the expected outer product of the score function $\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x} \mid \boldsymbol{\theta})$ .

Key Properties

$\mathbf{F}$ is always positive semidefinite
$\mathbf{F}$ equals the negative expected Hessian of the log-likelihood:

\mathbf{F} = -\mathbb{E}\left[\nabla^2 \log p(\mathbf{x} \mid \boldsymbol{\theta})\right]

For exponential family models, $\mathbf{F}$ has a clean closed-form expression
$\mathbf{F}$ defines a Riemannian metric on parameter space — it measures “distances” between distributions

The Empirical Fisher

In practice, we approximate $\mathbf{F}$ using the training data:

\hat{\mathbf{F}} = \frac{1}{n}\sum_{i=1}^{n} \nabla \log p(\mathbf{x}_i \mid \boldsymbol{\theta}) \, \nabla \log p(\mathbf{x}_i \mid \boldsymbol{\theta})^T

Warning: The empirical Fisher (using data labels) is not the same as the true Fisher (using model samples). The distinction matters theoretically, though both are used in practice.

Natural Gradient Descent

Standard gradient descent moves in the direction of steepest descent in Euclidean parameter space — it treats all parameter directions equally. But parameters are not all equal. A small change in one weight might dramatically change the output distribution, while a large change in another might barely affect it.

Natural gradient descent accounts for this by measuring steepest descent in distribution space, using the Fisher information as a metric:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \mathbf{F}^{-1} \nabla \mathcal{L}(\boldsymbol{\theta}_t)

The Fisher inverse $\mathbf{F}^{-1}$ rescales the gradient so that equal steps correspond to equal changes in the output distribution, not equal changes in parameter values.

Why Natural Gradient is Better

Consider reparameterizing a model: $\boldsymbol{\phi} = g(\boldsymbol{\theta})$ . Standard gradient descent gives different trajectories for $\boldsymbol{\theta}$ vs $\boldsymbol{\phi}$ — the optimization depends on the arbitrary choice of parameterization.

Natural gradient is parameterization-invariant: it gives the same trajectory regardless of how parameters are represented. This is because the Fisher information transforms like a metric tensor, automatically accounting for the parameterization.

Key insight: Natural gradient descent is the “correct” way to do gradient descent on probability distributions. It avoids the pathologies of standard gradient descent when the parameter space has a non-Euclidean geometry — which is the case for virtually all probabilistic models. Adam can be seen as a diagonal approximation to the natural gradient.

Connection to Adam

Adam’s update $\frac{\hat{m}_t}{\sqrt{\hat{v}_t}}$ approximately rescales each parameter by the inverse square root of its typical gradient magnitude. This is a diagonal approximation to the natural gradient:

\text{Adam} \approx \text{diag}(\mathbf{F})^{-1/2} \nabla \mathcal{L}

This connection explains why Adam works so well with minimal tuning — it implicitly accounts for the geometry of parameter space, albeit only along the coordinate axes.

K-FAC: Kronecker-Factored Approximate Curvature

K-FAC makes natural gradient practical for deep networks by exploiting the structure of neural network layers.

The Key Approximation

For a fully connected layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$ , the Fisher information block for $\mathbf{W}$ is:

\mathbf{F}_\mathbf{W} = \mathbb{E}[\mathbf{a}\mathbf{a}^T] \otimes \mathbb{E}[\mathbf{g}\mathbf{g}^T]

where $\mathbf{a}$ is the input activation, $\mathbf{g}$ is the output gradient, and $\otimes$ is the Kronecker product.

K-FAC approximates this as a Kronecker product of two smaller matrices:

\mathbf{F}_\mathbf{W} \approx \mathbf{A} \otimes \mathbf{G}

where $\mathbf{A} = \mathbb{E}[\mathbf{a}\mathbf{a}^T]$ and $\mathbf{G} = \mathbb{E}[\mathbf{g}\mathbf{g}^T]$ .

Why Kronecker Structure Helps

If $\mathbf{W}$ is $m \times n$ , the full Fisher block is $mn \times mn$ — storing and inverting it is $O(m^2n^2)$ . The Kronecker factorization stores two matrices of sizes $n \times n$ and $m \times m$ , and inversion uses:

(\mathbf{A} \otimes \mathbf{G})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{G}^{-1}

This reduces cost from $O(m^3 n^3)$ to $O(m^3 + n^3)$ — a massive reduction. For a layer with 1000 inputs and 1000 outputs, the speedup is $\sim 10^6$ .

K-FAC in Practice

K-FAC maintains running averages of $\mathbf{A}$ and $\mathbf{G}$ for each layer, periodically inverts them, and uses the Kronecker-factored inverse Fisher for parameter updates. It typically converges in fewer iterations than Adam, though each iteration is more expensive.

Hessian-Free Optimization

Hessian-free (truncated Newton) methods avoid storing the full Hessian by computing Hessian-vector products $\mathbf{H}\mathbf{v}$ efficiently.

The key identity (Pearlmutter, 1994):

\mathbf{H}\mathbf{v} = \nabla_{\boldsymbol{\theta}}\left[(\nabla_{\boldsymbol{\theta}} \mathcal{L})^T \mathbf{v}\right]

This computes $\mathbf{H}\mathbf{v}$ using one forward and one backward pass — the same cost as computing the gradient itself. No need to store $\mathbf{H}$ .

With Hessian-vector products, we can solve $\mathbf{H}\mathbf{d} = -\nabla \mathcal{L}$ using the conjugate gradient (CG) algorithm, which only requires matrix-vector products. A few CG iterations give an approximate Newton direction without ever forming the Hessian.

Comparison of Methods

Method	Uses	Per-step cost	Memory	Convergence
GD	$\nabla \mathcal{L}$	$O(n)$	$O(n)$	Linear
Adam	$\nabla \mathcal{L}$ + moments	$O(n)$	$O(n)$	Adaptive linear
L-BFGS	$\nabla \mathcal{L}$ history	$O(mn)$	$O(mn)$	Superlinear
K-FAC	Kronecker Fisher	$O(n + \sum d_i^3)$	$O(\sum d_i^2)$	Near-quadratic
Hessian-free	$\mathbf{H}\mathbf{v}$ products	$O(kn)$ per CG step	$O(n)$	Near-quadratic
Newton	Full $\mathbf{H}$	$O(n^3)$	$O(n^2)$	Quadratic

where $n$ is the total parameter count, $d_i$ is the dimension of layer $i$ , $m$ is the L-BFGS memory size, and $k$ is the number of CG iterations.

Practical Considerations

When to Use Second-Order Methods

Small to medium models ( $n < 10^6$ ): L-BFGS or K-FAC can significantly speed up training
Supervised learning with well-defined loss: K-FAC works well for classification and regression
Fine-tuning: When starting near a good solution, second-order methods converge faster
Research and understanding: Hessian analysis reveals loss landscape structure

When First-Order Methods Win

Very large models (LLMs, large vision models): Adam/AdamW with careful scheduling is hard to beat due to simplicity and parallelism
GANs and adversarial training: The loss landscape is not well-approximated by a quadratic
Reinforcement learning: Non-stationary objectives make curvature estimates unreliable

Key insight: Second-order methods converge in fewer iterations but each iteration is more expensive. The total wall-clock time depends on the trade-off. For large-scale deep learning, first-order methods usually win because they parallelize better on GPUs. But understanding second-order methods illuminates why optimizers like Adam work and how to design better ones.

Why This Matters for ML

Second-order and natural gradient methods provide the theoretical backbone for practical optimization:

Adam is a diagonal natural gradient — understanding the Fisher explains why Adam adapts so well
K-FAC achieves near-second-order convergence at tractable cost for medium-scale networks
The Fisher information defines the geometry of distribution space — essential for understanding why some parameterizations train better
Hessian-vector products enable loss landscape analysis (eigenvalue spectra, saddle point detection)
Gauss-Newton connects to generalized linear models and is the theoretical basis for many classical ML optimizers

Summary

Newton’s method uses the full Hessian for quadratic convergence but costs $O(n^3)$ — infeasible for large models
Gauss-Newton approximates the Hessian using only the Jacobian — always positive semidefinite, cheaper, and stable
The Fisher Information Matrix measures sensitivity of the model output to parameter changes and defines a Riemannian metric on parameter space
Natural gradient descent uses $\mathbf{F}^{-1}\nabla\mathcal{L}$ — the correct descent direction in distribution space, invariant to reparameterization
Adam $\approx$ diagonal natural gradient — this connection explains its robustness
K-FAC makes natural gradient practical by factoring the Fisher as Kronecker products per layer
Hessian-free methods compute $\mathbf{Hv}$ products cheaply, enabling conjugate gradient solvers
Next: numerical stability addresses the practical challenges of implementing these methods

References

Amari, S. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251-276.
Martens, J. (2010). Deep Learning via Hessian-Free Optimization. ICML.
Martens, J., & Grosse, R. (2015). Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. ICML. arXiv:1503.05671
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Chapters 6-7.
Pascanu, R., & Bengio, Y. (2013). Revisiting Natural Gradient for Deep Networks. ICLR.
Kunstner, F., Balles, L., & Hennig, P. (2019). Limitations of the Empirical Fisher Approximation. NeurIPS.