Calculus of Variations: Optimizing Over Functions

Calculus & Optimization Series 12 / 18

Beyond Parameter Optimization

Standard optimization finds the best parameters — a vector $\boldsymbol{\theta}^*$ that minimizes a loss. The calculus of variations goes further: it finds the best function $f^*$ that minimizes a functional (a function of functions).

This may sound abstract, but it directly underpins:

Variational inference — approximate intractable posteriors in Bayesian models
VAEs — the ELBO objective is derived from variational principles
Maximum entropy — finding the least-biased distribution subject to constraints
Optimal control — finding the best policy in reinforcement learning

If integration is the engine of probabilistic reasoning, the calculus of variations is its optimization counterpart.

Functionals

A functional maps a function to a scalar. We write $J[f]$ to distinguish it from ordinary functions $f(x)$ .

Example: The arc length of a curve $y = f(x)$ from $x = a$ to $x = b$ is a functional:

$J[f] = \int_a^b \sqrt{1 + [f'(x)]^2} \, dx$

Different functions $f$ produce different arc lengths. The calculus of variations asks: which $f$ minimizes $J$ ?

General Form

Many functionals have the form:

J[f] = \int_a^b L(x, f(x), f'(x)) \, dx

where $L$ is called the Lagrangian (not to be confused with the Lagrangian in constrained optimization, though the ideas are related). The Lagrangian depends on the independent variable $x$ , the function value $f(x)$ , and its derivative $f'(x)$ .

The Euler-Lagrange Equation

The function $f^*$ that makes $J[f]$ stationary (minimizes, maximizes, or is a saddle point) satisfies the Euler-Lagrange equation:

\frac{\partial L}{\partial f} - \frac{d}{dx}\frac{\partial L}{\partial f'} = 0

This is the analog of $\nabla f = 0$ from finite-dimensional optimization, but for functions. Instead of setting a gradient to zero, we set a functional derivative to zero.

Derivation Sketch

Consider a small perturbation $f(x) + \epsilon \eta(x)$ where $\eta$ vanishes at the endpoints. The functional becomes $J[f + \epsilon\eta]$ , and stationarity requires:

\frac{d}{d\epsilon} J[f + \epsilon\eta] \bigg|_{\epsilon=0} = 0 \quad \forall \, \eta

Expanding and integrating by parts yields the Euler-Lagrange equation. The “for all $\eta$ ” condition is what produces a differential equation rather than a single constraint.

Worked Example: Shortest Path

Find the curve $y = f(x)$ of shortest length connecting $(0, 0)$ to $(1, 1)$ .

Functional: $J[f] = \int_0^1 \sqrt{1 + [f'(x)]^2} \, dx$

Here $L(x, f, f') = \sqrt{1 + (f')^2}$ . Since $L$ does not depend on $f$ , we have $\frac{\partial L}{\partial f} = 0$ , so the Euler-Lagrange equation becomes:

\frac{d}{dx}\frac{\partial L}{\partial f'} = 0 \implies \frac{\partial L}{\partial f'} = \frac{f'}{\sqrt{1 + (f')^2}} = C

This implies $f' = \text{constant}$ , so $f$ is a straight line. With boundary conditions $f(0) = 0$ and $f(1) = 1$ : $f(x) = x$ .

The shortest path between two points is a straight line — the calculus of variations confirms what geometry tells us.

The Functional Derivative

The functional derivative $\frac{\delta J}{\delta f(x)}$ generalizes the gradient to function spaces:

J[f + \epsilon\eta] \approx J[f] + \epsilon \int \frac{\delta J}{\delta f(x)} \eta(x) \, dx

For the standard integral functional $J[f] = \int L(x, f, f') \, dx$ :

\frac{\delta J}{\delta f(x)} = \frac{\partial L}{\partial f} - \frac{d}{dx}\frac{\partial L}{\partial f'}

Setting the functional derivative to zero gives the Euler-Lagrange equation. This is directly analogous to setting $\nabla_{\boldsymbol{\theta}} \mathcal{L} = 0$ in gradient-based optimization.

Maximum Entropy Principle

The maximum entropy distribution subject to constraints is found using variational methods. Given constraints on moments $\mathbb{E}[g_k(x)] = c_k$ , we maximize:

H[p] = -\int p(x) \ln p(x) \, dx

subject to $\int p(x) \, dx = 1$ and $\int g_k(x) p(x) \, dx = c_k$ .

Using Lagrange multipliers on the functional:

\mathcal{L}[p] = -\int p \ln p \, dx + \lambda_0\left(\int p \, dx - 1\right) + \sum_k \lambda_k\left(\int g_k p \, dx - c_k\right)

Setting the functional derivative to zero:

\frac{\delta \mathcal{L}}{\delta p(x)} = -\ln p(x) - 1 + \lambda_0 + \sum_k \lambda_k g_k(x) = 0

Solving: $p^*(x) = \exp\left(\lambda_0 - 1 + \sum_k \lambda_k g_k(x)\right)$

This is the exponential family form. Different constraints yield different distributions:

Constraints	Max-entropy distribution
None (just normalization)	Uniform
Fixed mean and variance	Gaussian
Fixed mean ( $x \geq 0$ )	Exponential
Fixed $\mathbb{E}[\ln x]$ and $\mathbb{E}[x]$	Gamma

Key insight: The maximum entropy principle provides a principled way to choose a probability distribution when you have limited information. It selects the distribution that makes the fewest assumptions beyond the known constraints. The exponential family — the workhorse of statistical ML — arises naturally from this principle.

Variational Inference

Variational inference (VI) is the most important application of variational methods in modern ML. It transforms an intractable integration problem into an optimization problem.

The Problem

In Bayesian inference, we want the posterior:

p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta})}{p(\mathcal{D})}

The denominator $p(\mathcal{D}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) \, d\boldsymbol{\theta}$ is an intractable integral for complex models.

The Variational Approach

Instead of computing $p(\boldsymbol{\theta} \mid \mathcal{D})$ exactly, we find the closest approximation $q(\boldsymbol{\theta})$ from a tractable family (e.g., Gaussians). “Closest” means minimizing the KL divergence:

q^* = \arg\min_{q \in \mathcal{Q}} D_{KL}(q(\boldsymbol{\theta}) \| p(\boldsymbol{\theta} \mid \mathcal{D}))

This is a variational problem — we are optimizing over the function $q$ .

The Evidence Lower Bound (ELBO)

The KL divergence above requires the unknown posterior. We can rewrite it:

\ln p(\mathcal{D}) = \underbrace{\mathbb{E}_{q}\left[\ln \frac{p(\mathcal{D}, \boldsymbol{\theta})}{q(\boldsymbol{\theta})}\right]}_{\text{ELBO}(q)} + D_{KL}(q \| p(\boldsymbol{\theta} \mid \mathcal{D}))

Since $D_{KL} \geq 0$ , the ELBO is a lower bound on the log-evidence. Maximizing the ELBO is equivalent to minimizing the KL divergence, but the ELBO is tractable:

\text{ELBO}(q) = \mathbb{E}_{q}[\ln p(\mathcal{D} \mid \boldsymbol{\theta})] - D_{KL}(q(\boldsymbol{\theta}) \| p(\boldsymbol{\theta}))

The first term is the expected log-likelihood (fit the data). The second term is the KL divergence between the approximate posterior and the prior (stay close to prior beliefs). This is the bias-variance trade-off in Bayesian form.

Key insight: The ELBO converts an intractable integration problem (computing the evidence) into a tractable optimization problem (maximizing a lower bound). This is the core idea of variational inference: replace integration with optimization.

Variational Autoencoders (VAEs)

The VAE applies variational inference to generative modeling. The model has:

Decoder $p_\theta(\mathbf{x} \mid \mathbf{z})$ : generates data $\mathbf{x}$ from latent code $\mathbf{z}$
Prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ : latent space distribution
Encoder $q_\phi(\mathbf{z} \mid \mathbf{x})$ : approximate posterior (recognition model)

The VAE loss (negative ELBO) for a single data point:

\mathcal{L}(\theta, \phi; \mathbf{x}) = -\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[\ln p_\theta(\mathbf{x} \mid \mathbf{z})] + D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))

The first term is the reconstruction loss (how well the decoder reconstructs $\mathbf{x}$ ). The second term is the regularization loss (how close the encoder’s posterior is to the prior).

Training uses the reparameterization trick to backpropagate through the sampling of $\mathbf{z}$ .

Optimal Control (Brief)

In reinforcement learning, the agent seeks a policy (function) that maximizes cumulative reward. The Hamilton-Jacobi-Bellman (HJB) equation is the continuous-time analog of the Bellman equation, derived from the calculus of variations:

-\frac{\partial V}{\partial t} = \max_{\mathbf{u}} \left[r(\mathbf{x}, \mathbf{u}) + \nabla_\mathbf{x} V \cdot f(\mathbf{x}, \mathbf{u})\right]

where $V$ is the value function, $\mathbf{u}$ is the control (action), $r$ is the reward, and $f$ is the dynamics. This connects the calculus of variations to RL’s core optimization problem.

Why This Matters for ML

The calculus of variations provides the theoretical foundation for some of the most important techniques in modern ML:

Variational inference approximates intractable posteriors by optimizing over distribution families
The ELBO is the objective function for VAEs and many Bayesian deep learning methods
Maximum entropy derives the exponential family — the foundation of generalized linear models
Optimal control theory connects to policy optimization in reinforcement learning
The functional derivative generalizes gradients to infinite-dimensional spaces

Summary

Functionals map functions to scalars; the calculus of variations optimizes over functions
The Euler-Lagrange equation is the necessary condition for a function to optimize a functional — the analog of $\nabla f = 0$
Maximum entropy uses variational methods to derive the exponential family from moment constraints
Variational inference replaces intractable integration with optimization by maximizing the ELBO
The ELBO = expected log-likelihood minus KL regularization — the training objective for VAEs
VAEs combine variational inference with deep learning using the reparameterization trick
Next: second-order and natural gradient methods refine optimization using curvature information

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 10.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. probml.github.io
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. arXiv:1312.6114
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. JASA, 112(518), 859-877. arXiv:1601.00670
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.