Calculus of Variations: Optimizing Over Functions

Learn the Euler-Lagrange equation, variational inference, and the ELBO — how optimizing over functions powers VAEs and Bayesian deep learning.

Calculus & Optimization March 7, 2026 8 min read

Beyond Parameter Optimization

Standard optimization finds the best parameters — a vector θ\boldsymbol{\theta}^* that minimizes a loss. The calculus of variations goes further: it finds the best function ff^* that minimizes a functional (a function of functions).

This may sound abstract, but it directly underpins:

  • Variational inference — approximate intractable posteriors in Bayesian models
  • VAEs — the ELBO objective is derived from variational principles
  • Maximum entropy — finding the least-biased distribution subject to constraints
  • Optimal control — finding the best policy in reinforcement learning

If integration is the engine of probabilistic reasoning, the calculus of variations is its optimization counterpart.

Functionals

A functional maps a function to a scalar. We write J[f]J[f] to distinguish it from ordinary functions f(x)f(x).

Example: The arc length of a curve y=f(x)y = f(x) from x=ax = a to x=bx = b is a functional:

J[f]=ab1+[f(x)]2dxJ[f] = \int_a^b \sqrt{1 + [f'(x)]^2} \, dx

Different functions ff produce different arc lengths. The calculus of variations asks: which ff minimizes JJ?

General Form

Many functionals have the form:

J[f]=abL(x,f(x),f(x))dxJ[f] = \int_a^b L(x, f(x), f'(x)) \, dx

where LL is called the Lagrangian (not to be confused with the Lagrangian in constrained optimization, though the ideas are related). The Lagrangian depends on the independent variable xx, the function value f(x)f(x), and its derivative f(x)f'(x).

The Euler-Lagrange Equation

The function ff^* that makes J[f]J[f] stationary (minimizes, maximizes, or is a saddle point) satisfies the Euler-Lagrange equation:

LfddxLf=0\frac{\partial L}{\partial f} - \frac{d}{dx}\frac{\partial L}{\partial f'} = 0

This is the analog of f=0\nabla f = 0 from finite-dimensional optimization, but for functions. Instead of setting a gradient to zero, we set a functional derivative to zero.

Derivation Sketch

Consider a small perturbation f(x)+ϵη(x)f(x) + \epsilon \eta(x) where η\eta vanishes at the endpoints. The functional becomes J[f+ϵη]J[f + \epsilon\eta], and stationarity requires:

ddϵJ[f+ϵη]ϵ=0=0η\frac{d}{d\epsilon} J[f + \epsilon\eta] \bigg|_{\epsilon=0} = 0 \quad \forall \, \eta

Expanding and integrating by parts yields the Euler-Lagrange equation. The “for all η\eta” condition is what produces a differential equation rather than a single constraint.

Worked Example: Shortest Path

Find the curve y=f(x)y = f(x) of shortest length connecting (0,0)(0, 0) to (1,1)(1, 1).

Functional: J[f]=011+[f(x)]2dxJ[f] = \int_0^1 \sqrt{1 + [f'(x)]^2} \, dx

Here L(x,f,f)=1+(f)2L(x, f, f') = \sqrt{1 + (f')^2}. Since LL does not depend on ff, we have Lf=0\frac{\partial L}{\partial f} = 0, so the Euler-Lagrange equation becomes:

ddxLf=0    Lf=f1+(f)2=C\frac{d}{dx}\frac{\partial L}{\partial f'} = 0 \implies \frac{\partial L}{\partial f'} = \frac{f'}{\sqrt{1 + (f')^2}} = C

This implies f=constantf' = \text{constant}, so ff is a straight line. With boundary conditions f(0)=0f(0) = 0 and f(1)=1f(1) = 1: f(x)=xf(x) = x.

The shortest path between two points is a straight line — the calculus of variations confirms what geometry tells us.

The Functional Derivative

The functional derivative δJδf(x)\frac{\delta J}{\delta f(x)} generalizes the gradient to function spaces:

J[f+ϵη]J[f]+ϵδJδf(x)η(x)dxJ[f + \epsilon\eta] \approx J[f] + \epsilon \int \frac{\delta J}{\delta f(x)} \eta(x) \, dx

For the standard integral functional J[f]=L(x,f,f)dxJ[f] = \int L(x, f, f') \, dx:

δJδf(x)=LfddxLf\frac{\delta J}{\delta f(x)} = \frac{\partial L}{\partial f} - \frac{d}{dx}\frac{\partial L}{\partial f'}

Setting the functional derivative to zero gives the Euler-Lagrange equation. This is directly analogous to setting θL=0\nabla_{\boldsymbol{\theta}} \mathcal{L} = 0 in gradient-based optimization.

Maximum Entropy Principle

The maximum entropy distribution subject to constraints is found using variational methods. Given constraints on moments E[gk(x)]=ck\mathbb{E}[g_k(x)] = c_k, we maximize:

H[p]=p(x)lnp(x)dxH[p] = -\int p(x) \ln p(x) \, dx

subject to p(x)dx=1\int p(x) \, dx = 1 and gk(x)p(x)dx=ck\int g_k(x) p(x) \, dx = c_k.

Using Lagrange multipliers on the functional:

L[p]=plnpdx+λ0(pdx1)+kλk(gkpdxck)\mathcal{L}[p] = -\int p \ln p \, dx + \lambda_0\left(\int p \, dx - 1\right) + \sum_k \lambda_k\left(\int g_k p \, dx - c_k\right)

Setting the functional derivative to zero:

δLδp(x)=lnp(x)1+λ0+kλkgk(x)=0\frac{\delta \mathcal{L}}{\delta p(x)} = -\ln p(x) - 1 + \lambda_0 + \sum_k \lambda_k g_k(x) = 0

Solving: p(x)=exp(λ01+kλkgk(x))p^*(x) = \exp\left(\lambda_0 - 1 + \sum_k \lambda_k g_k(x)\right)

This is the exponential family form. Different constraints yield different distributions:

ConstraintsMax-entropy distribution
None (just normalization)Uniform
Fixed mean and varianceGaussian
Fixed mean (x0x \geq 0)Exponential
Fixed E[lnx]\mathbb{E}[\ln x] and E[x]\mathbb{E}[x]Gamma

Key insight: The maximum entropy principle provides a principled way to choose a probability distribution when you have limited information. It selects the distribution that makes the fewest assumptions beyond the known constraints. The exponential family — the workhorse of statistical ML — arises naturally from this principle.

Variational Inference

Variational inference (VI) is the most important application of variational methods in modern ML. It transforms an intractable integration problem into an optimization problem.

The Problem

In Bayesian inference, we want the posterior:

p(θD)=p(Dθ)p(θ)p(D)p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta})}{p(\mathcal{D})}

The denominator p(D)=p(Dθ)p(θ)dθp(\mathcal{D}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) \, d\boldsymbol{\theta} is an intractable integral for complex models.

The Variational Approach

Instead of computing p(θD)p(\boldsymbol{\theta} \mid \mathcal{D}) exactly, we find the closest approximation q(θ)q(\boldsymbol{\theta}) from a tractable family (e.g., Gaussians). “Closest” means minimizing the KL divergence:

q=argminqQDKL(q(θ)p(θD))q^* = \arg\min_{q \in \mathcal{Q}} D_{KL}(q(\boldsymbol{\theta}) \| p(\boldsymbol{\theta} \mid \mathcal{D}))

This is a variational problem — we are optimizing over the function qq.

The Evidence Lower Bound (ELBO)

The KL divergence above requires the unknown posterior. We can rewrite it:

lnp(D)=Eq[lnp(D,θ)q(θ)]ELBO(q)+DKL(qp(θD))\ln p(\mathcal{D}) = \underbrace{\mathbb{E}_{q}\left[\ln \frac{p(\mathcal{D}, \boldsymbol{\theta})}{q(\boldsymbol{\theta})}\right]}_{\text{ELBO}(q)} + D_{KL}(q \| p(\boldsymbol{\theta} \mid \mathcal{D}))

Since DKL0D_{KL} \geq 0, the ELBO is a lower bound on the log-evidence. Maximizing the ELBO is equivalent to minimizing the KL divergence, but the ELBO is tractable:

ELBO(q)=Eq[lnp(Dθ)]DKL(q(θ)p(θ))\text{ELBO}(q) = \mathbb{E}_{q}[\ln p(\mathcal{D} \mid \boldsymbol{\theta})] - D_{KL}(q(\boldsymbol{\theta}) \| p(\boldsymbol{\theta}))

The first term is the expected log-likelihood (fit the data). The second term is the KL divergence between the approximate posterior and the prior (stay close to prior beliefs). This is the bias-variance trade-off in Bayesian form.

Key insight: The ELBO converts an intractable integration problem (computing the evidence) into a tractable optimization problem (maximizing a lower bound). This is the core idea of variational inference: replace integration with optimization.

Variational Autoencoders (VAEs)

The VAE applies variational inference to generative modeling. The model has:

  • Decoder pθ(xz)p_\theta(\mathbf{x} \mid \mathbf{z}): generates data x\mathbf{x} from latent code z\mathbf{z}
  • Prior p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}): latent space distribution
  • Encoder qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}): approximate posterior (recognition model)

The VAE loss (negative ELBO) for a single data point:

L(θ,ϕ;x)=Eqϕ(zx)[lnpθ(xz)]+DKL(qϕ(zx)p(z))\mathcal{L}(\theta, \phi; \mathbf{x}) = -\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[\ln p_\theta(\mathbf{x} \mid \mathbf{z})] + D_{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))

The first term is the reconstruction loss (how well the decoder reconstructs x\mathbf{x}). The second term is the regularization loss (how close the encoder’s posterior is to the prior).

Training uses the reparameterization trick to backpropagate through the sampling of z\mathbf{z}.

Optimal Control (Brief)

In reinforcement learning, the agent seeks a policy (function) that maximizes cumulative reward. The Hamilton-Jacobi-Bellman (HJB) equation is the continuous-time analog of the Bellman equation, derived from the calculus of variations:

Vt=maxu[r(x,u)+xVf(x,u)]-\frac{\partial V}{\partial t} = \max_{\mathbf{u}} \left[r(\mathbf{x}, \mathbf{u}) + \nabla_\mathbf{x} V \cdot f(\mathbf{x}, \mathbf{u})\right]

where VV is the value function, u\mathbf{u} is the control (action), rr is the reward, and ff is the dynamics. This connects the calculus of variations to RL’s core optimization problem.

Why This Matters for ML

The calculus of variations provides the theoretical foundation for some of the most important techniques in modern ML:

  • Variational inference approximates intractable posteriors by optimizing over distribution families
  • The ELBO is the objective function for VAEs and many Bayesian deep learning methods
  • Maximum entropy derives the exponential family — the foundation of generalized linear models
  • Optimal control theory connects to policy optimization in reinforcement learning
  • The functional derivative generalizes gradients to infinite-dimensional spaces

Summary

  • Functionals map functions to scalars; the calculus of variations optimizes over functions
  • The Euler-Lagrange equation is the necessary condition for a function to optimize a functional — the analog of f=0\nabla f = 0
  • Maximum entropy uses variational methods to derive the exponential family from moment constraints
  • Variational inference replaces intractable integration with optimization by maximizing the ELBO
  • The ELBO = expected log-likelihood minus KL regularization — the training objective for VAEs
  • VAEs combine variational inference with deep learning using the reparameterization trick
  • Next: second-order and natural gradient methods refine optimization using curvature information

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 10.
  • Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. probml.github.io
  • Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. arXiv:1312.6114
  • Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. JASA, 112(518), 859-877. arXiv:1601.00670
  • Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay