Integration and Expectation: The Continuous Side of Probability

Calculus & Optimization Series 11 / 18

Why Integration Matters for ML

Derivatives tell us how functions change. Integrals tell us how functions accumulate. In machine learning, integration is everywhere:

Probability densities are defined through integrals: $P(a \leq X \leq b) = \int_a^b f(x) \, dx$
Expectations are integrals: $\mathbb{E}[X] = \int x \, f(x) \, dx$
Marginalizing over latent variables requires integrating them out
Normalizing constants ensure distributions sum to 1
Evidence in Bayesian inference is an integral over all possible parameters

If derivatives are the engine of optimization, integrals are the engine of probabilistic reasoning.

The Definite Integral

The definite integral of $f$ from $a$ to $b$ is the signed area under the curve:

\int_a^b f(x) \, dx = \lim_{n \to \infty} \sum_{i=1}^{n} f(x_i^*) \Delta x

where $\Delta x = (b - a)/n$ and $x_i^*$ is a sample point in the $i$ -th subinterval. This is the Riemann sum — approximate the area with rectangles, then take the limit as the rectangles become infinitely thin.

Intuition: If the derivative answers “how fast is this changing?”, the integral answers “how much has accumulated?” They are inverse operations — the Fundamental Theorem of Calculus makes this precise.

The Fundamental Theorem of Calculus

The Fundamental Theorem connects differentiation and integration:

Part 1: If $F(x) = \int_a^x f(t) \, dt$ , then $F'(x) = f(x)$ .

Part 2: If $F$ is an antiderivative of $f$ (meaning $F' = f$ ), then:

\int_a^b f(x) \, dx = F(b) - F(a)

This transforms the problem of computing areas into the problem of finding antiderivatives.

Essential Antiderivatives

$f(x)$	$\int f(x) \, dx$	ML relevance
$x^n$	$\frac{x^{n+1}}{n+1} + C$	Polynomial features
$e^x$	$e^x + C$	Exponential family
$1/x$	$\ln\\|x\\| + C$	Log-likelihood
$e^{-x^2}$	No closed form	Gaussian — requires special functions
$\frac{1}{1+e^{-x}}$	$\ln(1 + e^x) + C$	Sigmoid / softplus

The Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}$ has no elementary antiderivative but has a known closed-form value — a remarkable result that underpins the entire normal distribution.

Integration Techniques

Substitution (Change of Variables)

If $u = g(x)$ , then:

\int f(g(x)) \, g'(x) \, dx = \int f(u) \, du

This is the integral counterpart of the chain rule.

Example: $\int 2x \, e^{x^2} dx$ . Let $u = x^2$ , $du = 2x \, dx$ : $\int e^u \, du = e^u + C = e^{x^2} + C$

Integration by Parts

\int u \, dv = uv - \int v \, du

This is the integral counterpart of the product rule. It is essential for deriving expectations of products and for working with information-theoretic quantities.

The Gaussian Integral

The integral $\int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi}$ is foundational. The normalization constant of the Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$ follows directly:

\int_{-\infty}^{\infty} \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) dx = 1

Generalizing to $n$ dimensions with covariance matrix $\boldsymbol{\Sigma}$ :

\int_{\mathbb{R}^n} \frac{1}{(2\pi)^{n/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right) d\mathbf{x} = 1

Multiple Integrals

Functions of several variables require multiple integrals. For $f: \mathbb{R}^2 \to \mathbb{R}$ :

\iint_D f(x, y) \, dA = \int_a^b \int_{c(x)}^{d(x)} f(x, y) \, dy \, dx

Fubini’s theorem lets us compute double integrals as iterated single integrals — integrating one variable at a time. This is precisely how marginalization works.

Change of Variables in Multiple Integrals

When transforming coordinates $\mathbf{u} = g(\mathbf{x})$ :

\int f(\mathbf{x}) \, d\mathbf{x} = \int f(g^{-1}(\mathbf{u})) \, |\det \mathbf{J}_{g^{-1}}| \, d\mathbf{u}

The Jacobian determinant $|\det \mathbf{J}|$ accounts for how the transformation stretches or compresses volume. This is the mathematical foundation of normalizing flows — a class of generative models that transform simple distributions into complex ones using invertible functions with tractable Jacobians.

Expectation as Integration

The expected value of a continuous random variable $X$ with density $f(x)$ is:

\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx

More generally, for any function $h(X)$ :

\mathbb{E}[h(X)] = \int h(x) \, f(x) \, dx

This formula computes every statistical quantity we care about:

Quantity	Formula	Integral form
Mean	$\mu = \mathbb{E}[X]$	$\int x \, f(x) \, dx$
Variance	$\text{Var}(X) = \mathbb{E}[(X-\mu)^2]$	$\int (x-\mu)^2 f(x) \, dx$
Entropy	$H(X) = -\mathbb{E}[\log f(X)]$	$-\int f(x) \log f(x) \, dx$
KL divergence	$D_{KL}(p \\| q)$	$\int p(x) \log \frac{p(x)}{q(x)} dx$
Cross-entropy	$H(p, q)$	$-\int p(x) \log q(x) \, dx$

Key insight: Nearly every loss function and evaluation metric in ML can be written as an expectation — an integral of some function weighted by a probability distribution. Understanding this unifies many seemingly different concepts under one framework.

Marginalization

Marginalization integrates out (eliminates) variables we do not need. If $p(x, z)$ is a joint density:

p(x) = \int p(x, z) \, dz = \int p(x \mid z) \, p(z) \, dz

This is the continuous version of the law of total probability.

Why Marginalization is Hard

In Bayesian inference, the evidence (marginal likelihood) is:

p(\mathcal{D}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta}) \, d\boldsymbol{\theta}

For a neural network with millions of parameters, this integral is over a million-dimensional space — computationally intractable. This intractability motivates:

Variational inference: Approximate the integral with an optimization problem (see calculus of variations)
Monte Carlo methods: Estimate the integral using random samples
Laplace approximation: Approximate the integrand as Gaussian using a second-order Taylor expansion

Monte Carlo Integration

When integrals lack closed-form solutions, Monte Carlo integration estimates them using random samples. The key identity:

\mathbb{E}_{x \sim p}[h(x)] = \int h(x) \, p(x) \, dx \approx \frac{1}{N}\sum_{i=1}^{N} h(x_i), \quad x_i \sim p

Draw $N$ samples from $p$ , evaluate $h$ at each, and average. By the law of large numbers, this converges to the true integral as $N \to \infty$ .

Convergence Rate

Monte Carlo estimators converge at rate $O(1/\sqrt{N})$ regardless of dimension. This is remarkable — deterministic quadrature methods have rates that degrade exponentially with dimension (the curse of dimensionality), but Monte Carlo does not.

Key insight: Monte Carlo integration is the reason probabilistic ML scales to high dimensions. A 1000-dimensional integral is intractable for grid-based methods but routine for Monte Carlo. This is why sampling-based methods (MCMC, variational inference with reparameterization) dominate modern Bayesian deep learning.

Importance Sampling

When sampling from $p$ is difficult, we can sample from a different distribution $q$ and reweight:

\mathbb{E}_{p}[h(x)] = \mathbb{E}_{q}\left[h(x) \frac{p(x)}{q(x)}\right] \approx \frac{1}{N}\sum_{i=1}^{N} h(x_i) \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q

The ratio $w(x) = p(x)/q(x)$ is the importance weight. Importance sampling appears in:

Off-policy reinforcement learning (correcting for behavior policy)
Variational autoencoders (importance-weighted ELBO)
Rare event estimation

The Reparameterization Trick

Variational autoencoders need to backpropagate through an expectation:

\nabla_\phi \mathbb{E}_{z \sim q_\phi(z)}[f(z)]

The problem: $z$ is sampled from $q_\phi$ , which depends on the parameters $\phi$ we want to differentiate. We cannot backpropagate through a sampling operation.

The reparameterization trick rewrites the sampling as a deterministic function of a noise variable:

z = \mu_\phi + \sigma_\phi \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I})

Now the expectation is over $\epsilon$ (independent of $\phi$ ), and we can move the gradient inside:

\nabla_\phi \mathbb{E}_{\epsilon}[f(\mu_\phi + \sigma_\phi \odot \epsilon)] = \mathbb{E}_{\epsilon}[\nabla_\phi f(\mu_\phi + \sigma_\phi \odot \epsilon)]

Key insight: The reparameterization trick connects integration and differentiation — it allows us to compute gradients of expectations, enabling end-to-end training of models with stochastic latent variables (VAEs, stochastic neural networks).

Worked Example: Computing Evidence

Consider a simple Bayesian model: $y \sim \mathcal{N}(\theta, 1)$ with prior $\theta \sim \mathcal{N}(0, \sigma_0^2)$ . The evidence for observing $y = 3$ :

\begin{aligned} p(y = 3) &= \int_{-\infty}^{\infty} p(y = 3 \mid \theta) \, p(\theta) \, d\theta \\[6pt] &= \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(3-\theta)^2/2} \cdot \frac{1}{\sigma_0\sqrt{2\pi}} e^{-\theta^2/(2\sigma_0^2)} \, d\theta \\[6pt] &= \frac{1}{2\pi\sigma_0} \int_{-\infty}^{\infty} \exp\left(-\frac{(3-\theta)^2}{2} - \frac{\theta^2}{2\sigma_0^2}\right) d\theta \end{aligned}

This is a Gaussian integral. Completing the square in the exponent yields:

p(y = 3) = \frac{1}{\sqrt{2\pi(1 + \sigma_0^2)}} \exp\left(-\frac{9}{2(1 + \sigma_0^2)}\right)

The evidence is itself Gaussian with variance $1 + \sigma_0^2$ . This closed-form solution is the exception — for most models, the evidence integral is intractable.

Why This Matters for ML

Integration is the mathematical backbone of probabilistic ML:

Probability densities are normalized by integrals — without integration, we cannot define continuous distributions
Expectations (means, variances, losses) are all integrals over distributions
Marginalization integrates out latent variables — essential for Bayesian inference and mixture models
Monte Carlo methods estimate intractable integrals using random samples, scaling to arbitrary dimensions
The reparameterization trick allows backpropagation through stochastic sampling, enabling VAE training
Normalizing flows use the change-of-variables formula (Jacobian determinant) to define complex distributions

Summary

The definite integral computes accumulated area; the Fundamental Theorem links it to antiderivatives
Multiple integrals and the change-of-variables formula (Jacobian) generalize to higher dimensions
Expectation is an integral: $\mathbb{E}[h(X)] = \int h(x) f(x) \, dx$ — nearly every ML quantity is an expectation
Marginalization integrates out unwanted variables but is often intractable in high dimensions
Monte Carlo integration estimates integrals via sampling at rate $O(1/\sqrt{N})$ regardless of dimension
Importance sampling reweights samples from an easy distribution to estimate expectations under a hard one
The reparameterization trick enables gradient-based optimization through stochastic sampling
Next: calculus of variations optimizes over entire functions, not just parameters

References

Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapters 5-7, 15.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1-2, 10.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 2-4. probml.github.io
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. arXiv:1312.6114
Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.