- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Beyond Parameter Optimization
Standard optimization finds the best parameters — a vector that minimizes a loss. The calculus of variations goes further: it finds the best function that minimizes a functional (a function of functions).
This may sound abstract, but it directly underpins:
- Variational inference — approximate intractable posteriors in Bayesian models
- VAEs — the ELBO objective is derived from variational principles
- Maximum entropy — finding the least-biased distribution subject to constraints
- Optimal control — finding the best policy in reinforcement learning
If integration is the engine of probabilistic reasoning, the calculus of variations is its optimization counterpart.
Functionals
A functional maps a function to a scalar. We write to distinguish it from ordinary functions .
Example: The arc length of a curve from to is a functional:
Different functions produce different arc lengths. The calculus of variations asks: which minimizes ?
General Form
Many functionals have the form:
where is called the Lagrangian (not to be confused with the Lagrangian in constrained optimization, though the ideas are related). The Lagrangian depends on the independent variable , the function value , and its derivative .
The Euler-Lagrange Equation
The function that makes stationary (minimizes, maximizes, or is a saddle point) satisfies the Euler-Lagrange equation:
This is the analog of from finite-dimensional optimization, but for functions. Instead of setting a gradient to zero, we set a functional derivative to zero.
Derivation Sketch
Consider a small perturbation where vanishes at the endpoints. The functional becomes , and stationarity requires:
Expanding and integrating by parts yields the Euler-Lagrange equation. The “for all ” condition is what produces a differential equation rather than a single constraint.
Worked Example: Shortest Path
Find the curve of shortest length connecting to .
Functional:
Here . Since does not depend on , we have , so the Euler-Lagrange equation becomes:
This implies , so is a straight line. With boundary conditions and : .
The shortest path between two points is a straight line — the calculus of variations confirms what geometry tells us.
The Functional Derivative
The functional derivative generalizes the gradient to function spaces:
For the standard integral functional :
Setting the functional derivative to zero gives the Euler-Lagrange equation. This is directly analogous to setting in gradient-based optimization.
Maximum Entropy Principle
The maximum entropy distribution subject to constraints is found using variational methods. Given constraints on moments , we maximize:
subject to and .
Using Lagrange multipliers on the functional:
Setting the functional derivative to zero:
Solving:
This is the exponential family form. Different constraints yield different distributions:
| Constraints | Max-entropy distribution |
|---|---|
| None (just normalization) | Uniform |
| Fixed mean and variance | Gaussian |
| Fixed mean () | Exponential |
| Fixed and | Gamma |
Key insight: The maximum entropy principle provides a principled way to choose a probability distribution when you have limited information. It selects the distribution that makes the fewest assumptions beyond the known constraints. The exponential family — the workhorse of statistical ML — arises naturally from this principle.
Variational Inference
Variational inference (VI) is the most important application of variational methods in modern ML. It transforms an intractable integration problem into an optimization problem.
The Problem
In Bayesian inference, we want the posterior:
The denominator is an intractable integral for complex models.
The Variational Approach
Instead of computing exactly, we find the closest approximation from a tractable family (e.g., Gaussians). “Closest” means minimizing the KL divergence:
This is a variational problem — we are optimizing over the function .
The Evidence Lower Bound (ELBO)
The KL divergence above requires the unknown posterior. We can rewrite it:
Since , the ELBO is a lower bound on the log-evidence. Maximizing the ELBO is equivalent to minimizing the KL divergence, but the ELBO is tractable:
The first term is the expected log-likelihood (fit the data). The second term is the KL divergence between the approximate posterior and the prior (stay close to prior beliefs). This is the bias-variance trade-off in Bayesian form.
Key insight: The ELBO converts an intractable integration problem (computing the evidence) into a tractable optimization problem (maximizing a lower bound). This is the core idea of variational inference: replace integration with optimization.
Variational Autoencoders (VAEs)
The VAE applies variational inference to generative modeling. The model has:
- Decoder : generates data from latent code
- Prior : latent space distribution
- Encoder : approximate posterior (recognition model)
The VAE loss (negative ELBO) for a single data point:
The first term is the reconstruction loss (how well the decoder reconstructs ). The second term is the regularization loss (how close the encoder’s posterior is to the prior).
Training uses the reparameterization trick to backpropagate through the sampling of .
Optimal Control (Brief)
In reinforcement learning, the agent seeks a policy (function) that maximizes cumulative reward. The Hamilton-Jacobi-Bellman (HJB) equation is the continuous-time analog of the Bellman equation, derived from the calculus of variations:
where is the value function, is the control (action), is the reward, and is the dynamics. This connects the calculus of variations to RL’s core optimization problem.
Why This Matters for ML
The calculus of variations provides the theoretical foundation for some of the most important techniques in modern ML:
- Variational inference approximates intractable posteriors by optimizing over distribution families
- The ELBO is the objective function for VAEs and many Bayesian deep learning methods
- Maximum entropy derives the exponential family — the foundation of generalized linear models
- Optimal control theory connects to policy optimization in reinforcement learning
- The functional derivative generalizes gradients to infinite-dimensional spaces
Summary
- Functionals map functions to scalars; the calculus of variations optimizes over functions
- The Euler-Lagrange equation is the necessary condition for a function to optimize a functional — the analog of
- Maximum entropy uses variational methods to derive the exponential family from moment constraints
- Variational inference replaces intractable integration with optimization by maximizing the ELBO
- The ELBO = expected log-likelihood minus KL regularization — the training objective for VAEs
- VAEs combine variational inference with deep learning using the reparameterization trick
- Next: second-order and natural gradient methods refine optimization using curvature information
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 10.
- Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. probml.github.io
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. arXiv:1312.6114
- Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. JASA, 112(518), 859-877. arXiv:1601.00670
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.